--- title: "Final Group Project: AirBnB Listings Cape_Town" author: "Group 14" date: "2020-09-20" output: html_document: theme: flatly highlight: zenburn number_sections: yes toc: yes toc_float: yes code_folding: show ---

Executive Summary

This report explores a dataset coming from AirBnB containing information on property listings in Cape Town, South Africa. The aim of this reports is to establish a multiple regression model capable of predicting the price for 2 people staying for 4 nights based on a series of factors.

We will begin by exploring the data itself to understand its composition and underlying factors. Next, we cleaned and improved the data making it more manageable by simplifying certain categorical variables and dealing with missing values in the variables we deemed useful. Similarly, we removed variables we considered as irrelevant for the purposes of the model.

We also created a map with geo taggings of price, property type and attached url and ran a series of visualizations to gain an understanding of the variables included in the dataset before testing and running regressions.

Finally we conducted a series of regression tests on varying factors as well as applying a series of transofrmations in order to find the best model possible, before finally predicting the price for 4 nights for 2 people based on a series of predetermined inputs. Diagnostics have been conducted throughout the regression section as we laid out mobel by model. After coming up with a final model we left some suggestions for improvement and further analysis.

Data Preparation

We start by loading packages.

Exploratory Data Analysis

Load the dataset

First, we will load the data set from an URL provided by our professor.

listings <- vroom("http://data.insideairbnb.com/south-africa/wc/cape-town/2020-06-21/data/listings.csv.gz") %>% 
    clean_names()

Raw values

To get an understanding of the data, we will look at the raw values with the glimpse function.

dplyr::glimpse(listings)
## Rows: 24,062
## Columns: 106
## $ id                                           <dbl> 3191, 15007, 15068, 1507…
## $ listing_url                                  <chr> "https://www.airbnb.com/…
## $ scrape_id                                    <dbl> 2.02e+13, 2.02e+13, 2.02…
## $ last_scraped                                 <date> 2020-06-23, 2020-06-24,…
## $ name                                         <chr> "Malleson Garden Cottage…
## $ summary                                      <chr> "This is a lovely, separ…
## $ space                                        <chr> "Beautiful self catering…
## $ description                                  <chr> "This is a lovely, separ…
## $ experiences_offered                          <chr> "none", "none", "none", …
## $ neighborhood_overview                        <chr> "Mowbray is on the South…
## $ notes                                        <chr> "There are pets on the p…
## $ transit                                      <chr> "The train station is ab…
## $ access                                       <chr> "Fully self-contained co…
## $ interaction                                  <chr> "I live in the main hous…
## $ house_rules                                  <chr> "There is no smoking in …
## $ thumbnail_url                                <lgl> NA, NA, NA, NA, NA, NA, …
## $ medium_url                                   <lgl> NA, NA, NA, NA, NA, NA, …
## $ picture_url                                  <chr> "https://a0.muscache.com…
## $ xl_picture_url                               <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_id                                      <dbl> 3754, 59072, 59318, 5934…
## $ host_url                                     <chr> "https://www.airbnb.com/…
## $ host_name                                    <chr> "Brigitte", "Dirk", "Lin…
## $ host_since                                   <date> 2008-10-21, 2009-12-01,…
## $ host_location                                <chr> "Cape Town, Western Cape…
## $ host_about                                   <chr> "I'm single and love to …
## $ host_response_time                           <chr> "N/A", "within a day", "…
## $ host_response_rate                           <chr> "N/A", "100%", "N/A", "1…
## $ host_acceptance_rate                         <chr> "100%", "100%", "33%", "…
## $ host_is_superhost                            <lgl> TRUE, TRUE, FALSE, FALSE…
## $ host_thumbnail_url                           <chr> "https://a0.muscache.com…
## $ host_picture_url                             <chr> "https://a0.muscache.com…
## $ host_neighbourhood                           <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_listings_count                          <dbl> 1, 11, 13, 5, 1, 1, 3, 1…
## $ host_total_listings_count                    <dbl> 1, 11, 13, 5, 1, 1, 3, 1…
## $ host_verifications                           <chr> "['email', 'phone', 'rev…
## $ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ host_identity_verified                       <lgl> TRUE, TRUE, FALSE, TRUE,…
## $ street                                       <chr> "Southern Suburbs, Weste…
## $ neighbourhood                                <lgl> NA, NA, NA, NA, NA, NA, …
## $ neighbourhood_cleansed                       <chr> "Ward 57", "Ward 23", "W…
## $ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, …
## $ city                                         <chr> "Southern Suburbs", "Cap…
## $ state                                        <chr> "Western Cape", "Western…
## $ zipcode                                      <dbl> 7700, 7441, 7441, 7441, …
## $ market                                       <chr> "Cape Town", "Cape Town"…
## $ smart_location                               <chr> "Southern Suburbs, South…
## $ country_code                                 <chr> "ZA", "ZA", "ZA", "ZA", …
## $ country                                      <chr> "South Africa", "South A…
## $ latitude                                     <dbl> -33.9, -33.8, -33.8, -33…
## $ longitude                                    <dbl> 18.5, 18.5, 18.5, 18.5, …
## $ is_location_exact                            <lgl> TRUE, TRUE, FALSE, TRUE,…
## $ property_type                                <chr> "Guesthouse", "House", "…
## $ room_type                                    <chr> "Entire home/apt", "Enti…
## $ accommodates                                 <dbl> 2, 6, 6, 2, 2, 2, 4, 3, …
## $ bathrooms                                    <dbl> 1.0, 3.0, 2.0, 1.5, 1.0,…
## $ bedrooms                                     <dbl> 1, 3, 3, 1, 1, 1, 2, 1, …
## $ beds                                         <dbl> 1, 4, 5, 2, 1, 1, 4, 1, …
## $ bed_type                                     <chr> "Real Bed", "Real Bed", …
## $ amenities                                    <chr> "{TV,\"Cable TV\",Wifi,\…
## $ square_feet                                  <dbl> NA, NA, NA, NA, NA, NA, …
## $ price                                        <chr> "$815.00", "$2,038.00", …
## $ weekly_price                                 <chr> "$5,204.00", "$9,000.00"…
## $ monthly_price                                <chr> "$20,816.00", "$40,000.0…
## $ security_deposit                             <chr> NA, "$2,500.00", "$3,000…
## $ cleaning_fee                                 <chr> "$260.00", "$850.00", "$…
## $ guests_included                              <dbl> 2, 6, 1, 2, 2, 3, 2, 1, …
## $ extra_people                                 <chr> "$0.00", "$250.00", "$0.…
## $ minimum_nights                               <dbl> 3, 2, 4, 5, 2, 14, 2, 2,…
## $ maximum_nights                               <dbl> 730, 120, 730, 1125, 730…
## $ minimum_minimum_nights                       <dbl> 3, 2, 4, 5, 2, 14, 2, 2,…
## $ maximum_minimum_nights                       <dbl> 3, 2, 4, 5, 2, 14, 2, 2,…
## $ minimum_maximum_nights                       <dbl> 730, 120, 730, 1125, 730…
## $ maximum_maximum_nights                       <dbl> 730, 120, 730, 1125, 730…
## $ minimum_nights_avg_ntm                       <dbl> 3.0, 2.0, 4.0, 5.0, 2.0,…
## $ maximum_nights_avg_ntm                       <dbl> 730, 120, 730, 1125, 730…
## $ calendar_updated                             <chr> "7 months ago", "4 month…
## $ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ availability_30                              <dbl> 30, 0, 30, 0, 30, 30, 0,…
## $ availability_60                              <dbl> 60, 0, 60, 0, 60, 60, 0,…
## $ availability_90                              <dbl> 90, 0, 90, 0, 90, 90, 0,…
## $ availability_365                             <dbl> 365, 0, 365, 234, 365, 3…
## $ calendar_last_scraped                        <date> 2020-06-23, 2020-06-24,…
## $ number_of_reviews                            <dbl> 54, 27, 0, 5, 0, 2, 8, 3…
## $ number_of_reviews_ltm                        <dbl> 6, 1, 0, 0, 0, 0, 1, 0, …
## $ first_review                                 <date> 2013-05-31, 2013-12-15,…
## $ last_review                                  <date> 2020-03-07, 2019-08-18,…
## $ review_scores_rating                         <dbl> 97, 97, NA, 100, NA, 90,…
## $ review_scores_accuracy                       <dbl> 10, 10, NA, 10, NA, 8, 1…
## $ review_scores_cleanliness                    <dbl> 10, 10, NA, 10, NA, 8, 1…
## $ review_scores_checkin                        <dbl> 10, 10, NA, 10, NA, 9, 1…
## $ review_scores_communication                  <dbl> 10, 10, NA, 10, NA, 10, …
## $ review_scores_location                       <dbl> 10, 10, NA, 10, NA, 10, …
## $ review_scores_value                          <dbl> 10, 10, NA, 10, NA, 8, 1…
## $ requires_license                             <lgl> FALSE, FALSE, FALSE, FAL…
## $ license                                      <dbl> NA, NA, NA, NA, NA, NA, …
## $ jurisdiction_names                           <lgl> NA, NA, NA, NA, NA, NA, …
## $ instant_bookable                             <lgl> TRUE, FALSE, FALSE, FALS…
## $ is_business_travel_ready                     <lgl> FALSE, FALSE, FALSE, FAL…
## $ cancellation_policy                          <chr> "moderate", "strict_14_w…
## $ require_guest_profile_picture                <lgl> FALSE, FALSE, FALSE, FAL…
## $ require_guest_phone_verification             <lgl> FALSE, FALSE, FALSE, FAL…
## $ calculated_host_listings_count               <dbl> 1, 4, 10, 5, 1, 1, 3, 1,…
## $ calculated_host_listings_count_entire_homes  <dbl> 1, 4, 10, 1, 1, 1, 2, 1,…
## $ calculated_host_listings_count_private_rooms <dbl> 0, 0, 0, 4, 0, 0, 0, 0, …
## $ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reviews_per_month                            <dbl> 0.63, 0.34, NA, 0.06, NA…

We find that the dataset has 24,602 observations (rows) with 106 variables (columns).

When looking at the data set as a whole, we can see that there are many different data types, such as characters chr, numeric doubles dbl, logicals lgl and dates date. Nevertheless, not all variables seem to be stored in the correct variable type. Let’s have a closer look.

We find the following 37 variables correctly stored as doubles:

We found the following 3 variables stored as doubles but believe they would be better stored as characters as they function more like names rather than numbers.

As these variables will however not be relevant for our further anaylsis, we will leave them untouched when cleaning the data set.

We did, however, find the following 8 variables that are wrongly stored and must be stored as numeric variables:

We will change these data types by cleaning the data.

As price is currently a character instead of a numeric variable, we want to change its variable type in the following. To do so, we will first use readr::parse_number() which drops any non-numeric characters before or after the first number.

While we are unsure about the currency, we will assume it’s in USD and not the local currency Rand.

listings <- listings %>% 
  mutate(price = parse_number(price))

#confirm that price is now stored as a number
typeof(listings$price)
## [1] "double"

Next, we change the rest of the variables listed above to numeric. We will keep in mind that the currency of all those is in dollars as we will eliminate the character “$” from the value.

#change all to numerics

listings <- listings %>% 
  mutate(weekly_price = parse_number(weekly_price))

listings <- listings %>% 
  mutate(monthly_price = parse_number(monthly_price))

listings <- listings %>% 
  mutate(security_deposit = parse_number(security_deposit))

listings <- listings %>% 
  mutate(cleaning_fee = parse_number(cleaning_fee))

listings <- listings %>% 
  mutate(extra_people = parse_number(extra_people))

#check all
typeof(listings$weekly_price)
## [1] "double"
typeof(listings$monthly_price)
## [1] "double"
typeof(listings$security_deposit)
## [1] "double"
typeof(listings$cleaning_fee)
## [1] "double"
typeof(listings$extra_people)
## [1] "double"

Lastly, we change the rates into numerics, keeping in mind that they are in percentages as we will eliminate the “%” sign from the value.

#change all to numerics

listings <- listings %>% 
  mutate(host_response_rate = parse_number(host_response_rate))

listings <- listings %>% 
  mutate(host_acceptance_rate = parse_number(host_acceptance_rate))

#check all
typeof(listings$host_response_rate)
## [1] "double"
typeof(listings$host_acceptance_rate)
## [1] "double"

While there are multiple variables in the dataset, here is a quick description of some of the relevant variables collected, with cost data typically expressed in US$. We will focus our analysis on these.

Summary Statistics

Skim

In order to compute summary statistics of our variables and to find NAs, we will now run both the favstats() and the skim() function.

skimr::skim(listings)
Table 1: Data summary
Name listings
Number of rows 24062
Number of columns 106
_______________________
Column type frequency:
character 35
Date 5
logical 17
numeric 49
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 24062 0
name 0 1.00 1 255 0 23472 0
summary 788 0.97 1 1000 0 21864 0
space 6737 0.72 1 1000 0 16201 0
description 537 0.98 1 1000 0 22699 0
experiences_offered 0 1.00 4 4 0 1 0
neighborhood_overview 7661 0.68 1 1000 0 14099 0
notes 13302 0.45 1 1000 0 9081 0
transit 8467 0.65 1 1000 0 13469 0
access 10271 0.57 1 1000 0 11566 0
interaction 9120 0.62 1 1000 0 12194 0
house_rules 12005 0.50 1 1000 0 10309 0
picture_url 0 1.00 80 146 0 23716 0
host_url 0 1.00 38 43 0 14826 0
host_name 2 1.00 1 35 0 5400 0
host_location 125 0.99 2 254 0 610 0
host_about 11128 0.54 1 3208 0 7090 18
host_response_time 2 1.00 3 18 0 5 0
host_thumbnail_url 2 1.00 55 106 0 14761 0
host_picture_url 2 1.00 57 109 0 14761 0
host_verifications 0 1.00 2 161 0 315 0
street 0 1.00 16 67 0 441 0
neighbourhood_cleansed 0 1.00 6 8 0 99 0
city 54 1.00 1 39 0 259 0
state 401 0.98 2 26 0 56 0
market 30 1.00 4 21 0 4 0
smart_location 0 1.00 12 53 0 293 0
country_code 0 1.00 2 2 0 2 0
country 0 1.00 5 12 0 2 0
property_type 0 1.00 3 22 0 42 0
room_type 0 1.00 10 15 0 4 0
bed_type 0 1.00 5 13 0 5 0
amenities 0 1.00 2 1672 0 22932 0
calendar_updated 0 1.00 5 13 0 73 0
cancellation_policy 0 1.00 6 27 0 9 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2020-06-21 2020-06-25 2020-06-23 5
host_since 2 1.00 2008-08-11 2020-06-17 2016-03-12 2999
calendar_last_scraped 0 1.00 2020-06-21 2020-06-25 2020-06-23 5
first_review 6883 0.71 2010-06-15 2020-06-22 2018-03-31 2226
last_review 6883 0.71 2013-04-19 2020-06-22 2020-02-04 1407

Variable type: logical

skim_variable n_missing complete_rate mean count
thumbnail_url 24062 0 NaN :
medium_url 24062 0 NaN :
xl_picture_url 24062 0 NaN :
host_is_superhost 2 1 0.23 FAL: 18493, TRU: 5567
host_neighbourhood 24062 0 NaN :
host_has_profile_pic 2 1 1.00 TRU: 23984, FAL: 76
host_identity_verified 2 1 0.26 FAL: 17792, TRU: 6268
neighbourhood 24062 0 NaN :
neighbourhood_group_cleansed 24062 0 NaN :
is_location_exact 0 1 0.79 TRU: 19060, FAL: 5002
has_availability 0 1 1.00 TRU: 24062
requires_license 0 1 0.00 FAL: 24062
jurisdiction_names 24062 0 NaN :
instant_bookable 0 1 0.52 TRU: 12426, FAL: 11636
is_business_travel_ready 0 1 0.00 FAL: 24062
require_guest_profile_picture 0 1 0.00 FAL: 24009, TRU: 53
require_guest_phone_verification 0 1 0.00 FAL: 23977, TRU: 85

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.42e+07 1.22e+07 3.19e+03 1.46e+07 2.30e+07 3.53e+07 4.39e+07 ▃▇▆▆▇
scrape_id 0 1.00 2.02e+13 0.00e+00 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ▁▁▇▁▁
host_id 0 1.00 9.93e+07 9.15e+07 1.69e+03 2.63e+07 6.24e+07 1.55e+08 3.50e+08 ▇▃▂▂▁
host_response_rate 13639 0.43 9.25e+01 2.13e+01 0.00e+00 1.00e+02 1.00e+02 1.00e+02 1.00e+02 ▁▁▁▁▇
host_acceptance_rate 5234 0.78 8.22e+01 2.81e+01 0.00e+00 7.50e+01 9.70e+01 1.00e+02 1.00e+02 ▁▁▁▁▇
host_listings_count 2 1.00 1.86e+01 1.36e+02 0.00e+00 1.00e+00 2.00e+00 5.00e+00 3.33e+03 ▇▁▁▁▁
host_total_listings_count 2 1.00 1.86e+01 1.36e+02 0.00e+00 1.00e+00 2.00e+00 5.00e+00 3.33e+03 ▇▁▁▁▁
zipcode 661 0.97 7.79e+03 9.96e+02 0.00e+00 7.53e+03 7.92e+03 8.00e+03 8.01e+04 ▇▁▁▁▁
latitude 0 1.00 -3.40e+01 1.00e-01 -3.43e+01 -3.40e+01 -3.39e+01 -3.39e+01 -3.35e+01 ▂▃▇▁▁
longitude 0 1.00 1.85e+01 1.30e-01 1.83e+01 1.84e+01 1.84e+01 1.85e+01 1.89e+01 ▇▅▁▁▁
accommodates 0 1.00 3.89e+00 2.46e+00 1.00e+00 2.00e+00 4.00e+00 5.00e+00 4.00e+01 ▇▁▁▁▁
bathrooms 15 1.00 1.74e+00 1.23e+00 0.00e+00 1.00e+00 1.00e+00 2.00e+00 5.00e+01 ▇▁▁▁▁
bedrooms 27 1.00 1.88e+00 1.43e+00 0.00e+00 1.00e+00 1.00e+00 3.00e+00 5.00e+01 ▇▁▁▁▁
beds 107 1.00 2.40e+00 2.00e+00 0.00e+00 1.00e+00 2.00e+00 3.00e+00 5.00e+01 ▇▁▁▁▁
square_feet 23995 0.00 9.27e+02 1.32e+03 0.00e+00 2.60e+02 5.92e+02 1.04e+03 6.46e+03 ▇▁▁▁▁
price 0 1.00 2.47e+03 7.53e+03 0.00e+00 6.59e+02 1.06e+03 2.00e+03 3.00e+05 ▇▁▁▁▁
weekly_price 22826 0.05 1.98e+04 7.25e+04 6.20e+01 3.57e+03 6.26e+03 1.20e+04 1.30e+06 ▇▁▁▁▁
monthly_price 22987 0.04 5.98e+04 2.15e+05 2.52e+02 1.14e+04 1.92e+04 3.48e+04 2.77e+06 ▇▁▁▁▁
security_deposit 8559 0.64 2.77e+03 6.32e+03 0.00e+00 0.00e+00 1.48e+03 3.00e+03 1.50e+05 ▇▁▁▁▁
cleaning_fee 7049 0.71 3.38e+02 5.12e+02 0.00e+00 1.00e+02 3.00e+02 4.50e+02 4.00e+04 ▇▁▁▁▁
guests_included 0 1.00 1.62e+00 1.44e+00 1.00e+00 1.00e+00 1.00e+00 2.00e+00 2.40e+01 ▇▁▁▁▁
extra_people 0 1.00 8.38e+01 2.20e+02 0.00e+00 0.00e+00 0.00e+00 9.65e+01 4.68e+03 ▇▁▁▁▁
minimum_nights 0 1.00 4.35e+00 1.44e+01 1.00e+00 1.00e+00 2.00e+00 4.00e+00 1.12e+03 ▇▁▁▁▁
maximum_nights 0 1.00 1.13e+03 6.45e+04 1.00e+00 6.00e+01 1.12e+03 1.12e+03 1.00e+07 ▇▁▁▁▁
minimum_minimum_nights 0 1.00 4.06e+00 1.39e+01 1.00e+00 1.00e+00 2.00e+00 3.00e+00 1.12e+03 ▇▁▁▁▁
maximum_minimum_nights 0 1.00 5.03e+00 1.51e+01 1.00e+00 2.00e+00 2.00e+00 5.00e+00 1.12e+03 ▇▁▁▁▁
minimum_maximum_nights 0 1.00 1.24e+03 6.45e+04 1.00e+00 1.60e+02 1.12e+03 1.12e+03 1.00e+07 ▇▁▁▁▁
maximum_maximum_nights 0 1.00 1.25e+03 6.45e+04 1.00e+00 1.80e+02 1.12e+03 1.12e+03 1.00e+07 ▇▁▁▁▁
minimum_nights_avg_ntm 0 1.00 4.36e+00 1.44e+01 1.00e+00 1.30e+00 2.00e+00 4.00e+00 1.12e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 0 1.00 1.25e+03 6.45e+04 1.00e+00 1.80e+02 1.12e+03 1.12e+03 1.00e+07 ▇▁▁▁▁
availability_30 0 1.00 1.78e+01 1.35e+01 0.00e+00 0.00e+00 2.70e+01 3.00e+01 3.00e+01 ▆▁▁▁▇
availability_60 0 1.00 3.71e+01 2.67e+01 0.00e+00 0.00e+00 5.50e+01 5.90e+01 6.00e+01 ▅▁▁▁▇
availability_90 0 1.00 5.71e+01 3.96e+01 0.00e+00 0.00e+00 8.30e+01 8.90e+01 9.00e+01 ▅▁▁▁▇
availability_365 0 1.00 2.03e+02 1.42e+02 0.00e+00 6.10e+01 2.06e+02 3.57e+02 3.65e+02 ▅▂▃▂▇
number_of_reviews 0 1.00 1.44e+01 2.94e+01 0.00e+00 0.00e+00 3.00e+00 1.40e+01 4.25e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 4.51e+00 8.67e+00 0.00e+00 0.00e+00 1.00e+00 5.00e+00 2.27e+02 ▇▁▁▁▁
review_scores_rating 7314 0.70 9.46e+01 8.87e+00 2.00e+01 9.30e+01 9.70e+01 1.00e+02 1.00e+02 ▁▁▁▁▇
review_scores_accuracy 7334 0.70 9.63e+00 9.00e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_cleanliness 7329 0.70 9.54e+00 9.40e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_checkin 7336 0.70 9.75e+00 7.80e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_communication 7332 0.70 9.75e+00 7.90e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_location 7336 0.70 9.72e+00 7.30e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_value 7341 0.69 9.50e+00 9.30e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
license 24037 0.00 4.36e+09 3.10e+08 4.01e+09 4.03e+09 4.32e+09 4.66e+09 4.74e+09 ▇▁▁▁▇
calculated_host_listings_count 0 1.00 1.29e+01 5.04e+01 1.00e+00 1.00e+00 2.00e+00 4.00e+00 3.86e+02 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 1.16e+01 4.86e+01 0.00e+00 1.00e+00 1.00e+00 2.00e+00 3.72e+02 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 8.40e-01 2.06e+00 0.00e+00 0.00e+00 0.00e+00 1.00e+00 2.40e+01 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 3.00e-02 4.40e-01 0.00e+00 0.00e+00 0.00e+00 0.00e+00 1.20e+01 ▇▁▁▁▁
reviews_per_month 6883 0.71 7.20e-01 8.90e-01 1.00e-02 1.40e-01 3.60e-01 9.80e-01 8.64e+00 ▇▁▁▁▁

We find that the original data set includes: - 43 character variables - 41 numeric variables - 17 logical variables - 5 date variables

(before data cleaning done in previous section)

After the first data cleaning in the section above, we find:

  • 35 character variables
  • 49 numeric variables
  • 17 logical variables
  • 5 date variables

MISSING VALUES

This is how many missing values we find within our relevant variables:

  • price: 0 missing

  • extra_people: 0 missing

  • property_type: 0 missing

  • room_type: 0 missing

  • number_of_reviews: 0 missing

  • longitude , latitude: both 0 missing

  • neighbourhood_cleansed: 0 missing

  • cleaning_fee: 7,049 missing

  • review_scores_rating: 7,314 missing

Note: We will go with price as relevant price variable as it indicates the price per night and the other price variables (weekly_price and monthly_price) have several missing values.

As a missing value for cleaning_fee highly likely indicates, that there is no cleaning fee, we will convert all NAs to 0.

We already changed the cleaning_fee data type to numeric in the previous section and will now substitute all NAs with 0s as described.

#replace all NAs with 0
listings <- listings %>%
  mutate(cleaning_fee = case_when(
    is.na(cleaning_fee) ~ 0, 
    TRUE ~ cleaning_fee
  ))

#skimr::skim(listings)
# find no more missing values for cleaning_fee

In order to still capture the observations that have a missing review_scores_rating but don’t skew the average scores rating, we will calculate the mean of all scores (except NAs) and then convert all NAs into that mean score.

#calculate the mean of all scores except NAs
avg_review_scores_rating <- listings %>% 
  # pull the review_scores_rating as a list
  pull(review_scores_rating) %>% 
  mean(na.rm = TRUE)

#convert NAs to the mean score
listings <- listings %>% 
  mutate(review_scores_rating = case_when(
    is.na(review_scores_rating) ~ avg_review_scores_rating,
    TRUE ~ review_scores_rating
  ))

#skimr::skim(listings)
# find no more missing values for review_scores_rating

Note: Variables with 100% missing values will be eliminated later)

PROPERTY TYPE

Looking at property_type, we can use the count function to determine how many categories there are their frequency.

#count property types
listings %>% 
  count(property_type) %>% 
  arrange(desc(n)) #arrange in descending order of count
property_typen
Apartment9670
House7030
Guest suite1488
Villa1051
Guesthouse994
Bed and breakfast613
Condominium590
Townhouse582
Serviced apartment465
Cottage449
Loft317
Boutique hotel166
Bungalow139
Other99
Hostel92
Farm stay56
Chalet48
Tiny house34
Cabin31
Aparthotel30
Hotel28
Nature lodge14
Camper/RV12
Earth house12
Dome house8
Resort7
Boat6
Dorm5
Hut4
Tipi4
Barn3
Casa particular (Cuba)2
Castle2
Lighthouse2
Treehouse2
Bus1
Cave1
Heritage hotel (India)1
Island1
Tent1
Vacation home1
Yurt1

We find that the top 4 property types are “Apartment” with 9,670 listings, “House” with 7,030 listings, “Guest suite” with 1,488 listings and “Villa” with 1,051 listings. Together these top 4 property types concern 19,239 listings, meaning they account for 80% (19,239 / 24,062) of all listings and are, therefore, most relevant for our analysis.

Since the vast majority of the observations in the data are one of the top four property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other.

#mutate to 5 types
listings <- listings %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Apartment","House", "Guest suite","Villa") ~ property_type, 
    TRUE ~ "Other"
  ))
 
#check prop_type_simplified compared to property_type
listings %>%
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n))      
property_typeprop_type_simplifiedn
ApartmentApartment9670
HouseHouse7030
Guest suiteGuest suite1488
VillaVilla1051
GuesthouseOther994
Bed and breakfastOther613
CondominiumOther590
TownhouseOther582
Serviced apartmentOther465
CottageOther449
LoftOther317
Boutique hotelOther166
BungalowOther139
OtherOther99
HostelOther92
Farm stayOther56
ChaletOther48
Tiny houseOther34
CabinOther31
AparthotelOther30
HotelOther28
Nature lodgeOther14
Camper/RVOther12
Earth houseOther12
Dome houseOther8
ResortOther7
BoatOther6
DormOther5
HutOther4
TipiOther4
BarnOther3
Casa particular (Cuba)Other2
CastleOther2
LighthouseOther2
TreehouseOther2
BusOther1
CaveOther1
Heritage hotel (India)Other1
IslandOther1
TentOther1
Vacation homeOther1
YurtOther1
#check only prop_type_simplified
listings %>%
  count(prop_type_simplified) %>%
  arrange(desc(n))   
prop_type_simplifiedn
Apartment9670
House7030
Other4823
Guest suite1488
Villa1051

MINIMUM NIGHTS

As Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels, we only want to include listings in our regression analysis that are intended for travel purposes.

#count values for minimum_nights
listings %>% 
  count(minimum_nights) %>% 
  arrange(desc(n)) #arrange in descending order of count to find most common values
minimum_nightsn
2       7436
1       6894
3       3527
7       1614
5       1489
4       1098
10       508
14       337
20       252
6       244
30       182
8       62
90       59
28       50
12       47
15       36
21       35
60       33
180       27
9       17
13       12
31       11
25       9
100       8
120       7
29       6
365       6
11       5
18       3
27       3
45       3
93       3
150       3
360       3
16       2
19       2
22       2
24       2
32       2
95       2
168       2
183       2
185       2
210       2
26       1
33       1
40       1
42       1
89       1
91       1
94       1
124       1
182       1
186       1
240       1
300       1
1.12e+031

The top 3 most common minimum days of stay required by the host are 2 nights with 7,436 observations, 1 night with 6,894 observations and 3 nights with 3,527 observations. Number 4 is 7 nights minimum stay with 1,614 observations.

This clearly shows that hosts are either fine with their guests doing a short-stay around 1-3 days or want them to stay at least a week at 7 days minimum stay.

Ranking place number 7, 8 and 9 are 10, 14 and 20 days respectively. With all this information, it is easy to see that hosts are either okay with short-term stays with frequently new guests and also frequently cleaning and tidying the space, whereas other hosts only want guests for long-term stays (e.g. 1-2 weeks). This is most probably the reason because only long-term stays are profitable for these hosts or they don’t want to tidy the space every second.

To analyse the data, we filter the data set so that it only includes observations with minimum_nights <= 4

#filter minimum nights <= 4
listings <- listings %>% 
  filter(minimum_nights <= 4)

#check
listings %>% 
  count(minimum_nights) %>% 
  arrange(desc(n)) 
minimum_nightsn
27436
16894
33527
41098

NEIGHBOURHOODS

We also find that there are several different neighbourhood variables, namely:

  • neigborhood_overview: gives a description
  • neighbourhood: all NAs
  • neighbourhood_cleansed: indicates which ward listing is in
  • neighbourhood_group_cleansed: all NAs

We will eliminate the the two variables that are just NAs and will try to summarize the neighbourhood_cleansed variable into only a few city areas instead of so many wards. Since Cape Town has 116 wards, we decide to split them into 24 sub councils, which deal with local functions for between three and six wards.

listings <- listings %>% 
#creating a new variable to take down sub council names
  mutate(sub_council = case_when(
    neighbourhood_cleansed %in% c("Ward 23","Ward 29", "Ward 32","Ward 104") ~ 1,
    neighbourhood_cleansed %in% c("Ward 6","Ward 7", "Ward 8","Ward 101", "Ward 102", "Ward 111") ~ 2,
    neighbourhood_cleansed %in% c("Ward 1","Ward 4", "Ward 5","Ward 70", "Ward 107", "Ward 113") ~ 3,
    neighbourhood_cleansed %in% c("Ward 25","Ward 26", "Ward 27","Ward 28", "Ward 30") ~ 4,
    neighbourhood_cleansed %in% c("Ward 13","Ward 20", "Ward 24","Ward 31", "Ward 50", "Ward 106") ~ 5,
    neighbourhood_cleansed %in% c("Ward 2","Ward 3", "Ward 9","Ward 10", "Ward 12", "Ward 22") ~ 6,
    neighbourhood_cleansed %in% c("Ward 21","Ward 103", "Ward 105","Ward 112") ~ 7,
    neighbourhood_cleansed %in% c("Ward 83","Ward 85", "Ward 86","Ward 100") ~ 8,
    neighbourhood_cleansed %in% c("Ward 18","Ward 87", "Ward 89","Ward 90", "Ward 91", "Ward 116") ~ 9,
    neighbourhood_cleansed %in% c("Ward 92","Ward 93", "Ward 94","Ward 97", "Ward 98", "Ward 99") ~ 10,
    neighbourhood_cleansed %in% c("Ward 40","Ward 44", "Ward 46","Ward 47") ~ 11,
    neighbourhood_cleansed %in% c("Ward 78","Ward 79", "Ward 81","Ward 82") ~ 12,
    neighbourhood_cleansed %in% c("Ward 34","Ward 35", "Ward 36","Ward 80", "Ward 88") ~ 13,
    neighbourhood_cleansed %in% c("Ward 37","Ward 38", "Ward 39","Ward 41", "Ward 42", "Ward 45") ~ 14,
    neighbourhood_cleansed %in% c("Ward 51","Ward 52", "Ward 53","Ward 55", "Ward 56") ~ 15,
    neighbourhood_cleansed %in% c("Ward 54","Ward 57", "Ward 74","Ward 77", "Ward 115") ~ 16,
    neighbourhood_cleansed %in% c("Ward 48","Ward 49", "Ward 60") ~ 17,
    neighbourhood_cleansed %in% c("Ward 63","Ward 65", "Ward 66","Ward 67", "Ward 68", "Ward 110") ~ 18,
    neighbourhood_cleansed %in% c("Ward 61","Ward 64", "Ward 69") ~ 19,
    neighbourhood_cleansed %in% c("Ward 58","Ward 59", "Ward 62","Ward 71", "Ward 72", "Ward 73") ~ 20,
    neighbourhood_cleansed %in% c("Ward 11","Ward 19", "Ward 108") ~ 21,
    neighbourhood_cleansed %in% c("Ward 14","Ward 16", "Ward 17","Ward 114") ~ 22,
    neighbourhood_cleansed %in% c("Ward 33","Ward 43", "Ward 75","Ward 76") ~ 23,
    TRUE ~ 24
  )) 

After consulting Cape Town experts in our group, we found that the 24 sub councils can be further categorised into 6 neighbourhood areas, which namely are Atlantic seaboard, CBD, Outer suburbs, Southern Suburbs, Cape Flats and Sommerset.

Atlantic seaboard- this term is often used by locals to describe the residential/ commercial areas facing the Atlantic seaboard and in front of table mountain, region is often characterized by luxury apartments and homes and offers many tourist attractions such a views from table mountain and beachfront restaurants

CBD-this region is generally considered the city centre and its surrounding areas characterized as the business/ commercial district of cape town.

Outer suburbs- these are regions generally away from the CBD and seaside, land mainly used for industrial purposes.

Southern Suburbs-Cape Town’s Southern Suburbs lie to the Southeast of the slopes of Table Mountain within rich valleys and vast plains reaching from just south of the Table Bay industrial neighbourhoods in the north to the False Bay coastal suburbs and the Cape Peninsula cliffs to the south, and are crossed North-South by the M3 and M5 freeways. In general, this area is identified as being the more affluent of the Cape Town Metropolis’ sections and includes the city’s most expensive residential neighbourhoods.

Cape Flats- this regions are generally characterized by their townships, this includes Mitchells Plainand and Khayelitsha and guguletho some of the regions largest townships, characterized by being very low income and run down.

Sommerset- Sommerset west is one of the largest coastal towns outside of the CBD, and offers both residential and tourist appeal away from the hussle of the city centre.

#change `sub_council` to character form since it represents a name tag
listings$sub_council <- as.character(listings$sub_council)
#summarize all wards from neighbourhood_cleansed to 6 neighbourhood areas
listings <- listings %>%
  mutate(neighbourhood_simplified = case_when(
    sub_council %in% c(1, 2, 7, 21) ~ "Outer Suburbs", 
    sub_council %in% c(3, 4, 5, 6) ~ "CBD", 
    sub_council %in% c(8, 22) ~ "Sommerset", 
    sub_council %in% c(9, 10, 12, 24) ~ "Cape Flats", 
    sub_council %in% c(11, 13, 14, 17, 18, 23) ~ "Southern Suburbs", 
    TRUE ~ "Atlantic Seaboard"
  ))

#check neighbourhood_simplified compared to sub_council
listings %>%
  count(neighbourhood_simplified, sub_council) %>%
  arrange(desc(n))   

neighbourhood_simplifiedsub_counciln
Atlantic Seaboard168782
Atlantic Seaboard192065
Atlantic Seaboard201848
CBD31554
Outer Suburbs11050
Atlantic Seaboard15788
Cape Flats24719
Sommerset8670
Outer Suburbs7477
CBD6279
Outer Suburbs2169
Southern Suburbs17155
Southern Suburbs18124
CBD466
Outer Suburbs2166
Sommerset2245
Southern Suburbs1135
Southern Suburbs2320
Cape Flats1018
Cape Flats129
Cape Flats99
CBD54
Southern Suburbs143
LOCATION

Further, we find these variables regarding location:

  • 441 unique values for street
  • 259 unique values for city
  • 56 unique values for state
  • 4 unique values for market
  • 2 unique values for country_code
  • 2 unique values for country

As several values for market, country_code and country seem odd, we inspect those values further in the dataset.

We find that market includes the following values:

  • “Cape Town”
  • “Other International”
  • “Krabi”
  • “D.C.”

However, they all have “Cape Town” as values for the smart location which is why we can either disregard this column or we assume the names are just wrong so that we can change the three odd names to “Cape Town” also.

#change "Other Internatinal", "Krabi", "D.C." to "Cape Town"
listings <- listings %>% 
    mutate(market = case_when(
    market == "Cape Town" ~ "Cape Town",
    TRUE ~ "Cape Town"
  ))

#check
listings %>% 
  count(market) 
marketn
Cape Town18955

When further investigating country_code and country, we find that all observations have the value “ZA” or “South Africa” but one observation which has the values “CG” and “Congo” respectively. As this observation’s smart_location is also “Congo”, we eliminate this observation as a wrongly assigned value.

#filter the observations to ensure they contain "ZA" for 'country_code'
listings <- listings %>% 
    filter(country_code != "CG")

#check
listings %>% 
  count(country_code) 
country_coden
ZA18954

We find that 6 values within state are called “.Western Cape” instead of “Western Cape” which we will change in the data cleansing as well. @SR

#change ".Western Cape" to "Western Cape"
listings$state[listings$state == ".Western Cape"] <- "Western Cape"
listings$state[listings$state == "Cape town"] <- "Cape Town"
listings$state[listings$state == "Cape Town - Western Cape"] <- "Cape Town"
listings$state[listings$state == "Cape Town Western Cape"] <- "Cape Town"
#check
listings %>% 
  count(state) 
staten
Bellville1
Blouberg1
Camps Bay2
Cape Province1
Cape Town82
Cape Town,1
CBD2
Central Bussiness District2
De Waterkant1
Gardens2
Gauteng1
Goodwood1
Gordons Bay4
Green Point1
Hout bay1
Hout Bay1
Muizenberg2
Parklands1
Rondebosch1
SA2
Simon's Town in Cape Town1
Simons Town1
Somerset West1
South africa1
South Africa21
Tarifa1
Wc1
WC1095
WC Hout Bay1
We1
Wes-Kaap43
western cape3
Western cape6
Western Cape17274
Western Cape Town1
Western Cape, Cape Town1
Western Cape, South Africa1
Western Cape/ Rondebosch1
Western Peovince6
Western province1
Western Province11
Westkap1
Wine Lands South Africa1
Woodstock6
364

We further find that host_location has several unique values including - “Cape Town, Western Cape, South Africa” - “ZA” - “Parow, Western Cape, South Africa”

listings %>% 
  count(host_location) %>% 
  arrange(desc(n))  

host_locationn
Cape Town, Western Cape, South Africa13442
ZA2952
South Africa367
Western Cape, South Africa325
Johannesburg, Gauteng, South Africa193
93
Stellenbosch, Western Cape, South Africa70
London, England, United Kingdom69
GB67
Pretoria, Gauteng, South Africa62
Durban, KwaZulu-Natal, South Africa51
US49
Sandton, Gauteng, South Africa48
United Kingdom41
Hamburg, Hamburg, Germany36
KwaZulu-Natal, South Africa32
England, United Kingdom29
Berlin, Berlin, Germany23
DE18
Paarl, Western Cape, South Africa18
Gauteng, South Africa17
Amsterdam, North Holland, Netherlands15
Germany15
Knysna, Western Cape, South Africa14
Dubai, Dubai, United Arab Emirates13
Hong Kong13
NL12
Randburg, Gauteng, South Africa12
London, United Kingdom11
CA9
Ibhayi, Eastern Cape, South Africa9
Port Elizabeth, Eastern Cape, South Africa9
Brussels, Brussels, Belgium8
Cologne, North Rhine-Westphalia, Germany8
Netherlands8
Paris, Île-de-France, France8
Welkom, Free State, South Africa8
ES7
Frankfurt, Hesse, Germany7
Franschhoek, Western Cape, South Africa7
IL7
New York, New York, United States7
Randfontein, Gauteng, South Africa7
San Francisco, California, United States7
Bern, Canton of Bern, Switzerland6
CN6
Copenhagen, Denmark6
London6
Los Angeles, California, United States6
Vienna, Vienna, Austria6
Zürich, Zurich, Switzerland6
Dolphin Coast, KwaZulu-Natal, South Africa5
Edinburgh, Scotland, United Kingdom5
Liverpool, England, United Kingdom5
Robertson, Western Cape, South Africa5
Schwabach, Bavaria, Germany5
Stockholm, Stockholm County, Sweden5
United States5
Barcelona, Catalonia, Spain4
BE4
Cape Town, South Africa4
East London, Eastern Cape, South Africa4
Edenvale, Gauteng, South Africa4
FR4
George, Western Cape, South Africa4
Germiston, Gauteng, South Africa4
Hermanus, Western Cape, South Africa4
Hillcrest, KwaZulu-Natal, South Africa4
Kimberley, Northern Cape, South Africa4
Langebaan, Western Cape, South Africa4
Merchtem, Flanders, Belgium4
Midrand, Gauteng, South Africa4
Munich, Bavaria, Germany4
NO4
originally from London now living the dream in Cape Town4
Oslo, Oslo, Norway4
Richards Bay, KwaZulu-Natal, South Africa4
SG4
Singapore4
Sun City, North West, South Africa4
Sydney, New South Wales, Australia4
Washington, District of Columbia, United States4
Wellington, Western Cape, South Africa4
Windhoek, Khomas Region, Namibia4
1b Rontree Avenue, Camps Bay CT SA 80053
AE3
Australia3
Colenso, KwaZulu-Natal, South Africa3
Despatch, Eastern Cape, South Africa3
Düsseldorf, North Rhine-Westphalia, Germany3
Fort Lauderdale, Florida, United States3
Grabouw, Western Cape, South Africa3
Grünwald, Bavaria, Germany3
Hanover, Lower Saxony, Germany3
Hostivice, Central Bohemian Region, Czechia3
Howick, KwaZulu-Natal, South Africa3
Jeffreys Bay, Eastern Cape, South Africa3
Johannesburg South, Gauteng, South Africa3
Lichtenburg, North West, South Africa3
Melbourne, Victoria, Australia3
Mudjimba, Queensland, Australia3
Plettenberg Bay, Western Cape, South Africa3
Seattle, Washington, United States3
Shelley, England, United Kingdom3
Simon's Town (False Bay Coast) South Africa3
Sir Lowry's Pass, Western Cape, South Africa3
Soweto, Gauteng, South Africa3
Surrey, United Kingdom3
Umhlanga, KwaZulu-Natal, South Africa3
Wenzhou, Zhejiang, China3
Windmeul, Western Cape, South Africa3
Worcester, Western Cape, South Africa3
ZW3
Abu Dhabi, Abu Dhabi, United Arab Emirates2
Alberton, Gauteng, South Africa2
Antwerp, Flanders, Belgium2
AR2
Arlington, Virginia, United States2
AT2
Auckland, Auckland, New Zealand2
Bad Wildungen, Hesse, Germany2
Bellingham, Washington, United States2
Betty's Bay, Western Cape, South Africa2
Between Cape Town and India2
Bracknell, England, United Kingdom2
Bredasdorp, Western Cape, South Africa2
CD2
Centurion, Gauteng, South Africa2
CH2
Claremont, California, United States2
De Aar, Northern Cape, South Africa2
Dortmund, North Rhine-Westphalia, Germany2
Dublin 6, County Dublin, Ireland2
Dublin, Dublin, Ireland2
Fish Hoek, CPT, South Africa 2
France2
Fuengirola, Andalucía, Spain2
Gauteng2
Gold Coast, Queensland, Australia2
Harare, Harare, Zimbabwe2
Hartbeespoort, North West, South Africa2
Hoofddorp, North Holland, Netherlands2
ID2
Kloof, KwaZulu-Natal, South Africa2
Lisbon, Lisbon, Portugal2
Lodi, Lombardy, Italy2
London, UK2
Lusaka, Lusaka, Zambia2
Manchester, England, United Kingdom2
Margate, KwaZulu-Natal, South Africa2
Minneapolis, Minnesota, United States2
Monaco-Ville, Monaco2
Morón, Ciego de Avila, Cuba2
Mouzens, Nouvelle-Aquitaine, France2
Nairobi, Nairobi County, Kenya2
Norwich, England, United Kingdom2
Nuremberg, Bavaria, Germany2
Oxford, England, United Kingdom2
Pandamatenga, North West, Botswana2
Perth, Western Australia, Australia2
Polokwane, Limpopo, South Africa2
Riebeeck Kasteel, Western Cape, South Africa2
RO2
Saint Francis Bay, Eastern Cape, South Africa2
Serramanna, Sardegna, Italy2
Sevenoaks, England, United Kingdom2
Seychelles2
South of France2
Still Bay, Western Cape, South Africa2
Tel Aviv-Yafo, Tel Aviv District, Israel2
Tokyo, Japan2
Toronto, Ontario, Canada2
UK2
Urmston, England, United Kingdom2
Vancouver, British Columbia, Canada2
Vanderbijlpark, Gauteng, South Africa2
Wessobrunn, Bavaria, Germany2
West Coast DC, Western Cape, South Africa2
White River, Mpumalanga, South Africa2
Whitefish, Montana, United States2
Williamsport, Pennsylvania, United States2
Windhoek, Khomas, Namibia2
Wollerau, Canton of Schwyz, Switzerland2
's-Gravenzande, South Holland, Netherlands1
305 Hillside Heights, Wessels Road, Greenpoint, 80051
Alsheim, Rhineland-Palatinate, Germany1
AM1
Arklow, County Wicklow, Ireland1
Assel, Grevenmacher, Luxembourg1
Atlantic Beach, Florida, United States1
AU1
Austin, Texas, United States1
Balearic Islands, Spain1
Baltimore, Maryland, United States1
Bangkok, Thailand1
Bath, England, United Kingdom1
Bay Lake, Florida, United States1
Beaufort West, Western Cape, South Africa1
Beauvechain, Walloon Region, Belgium1
Bedford, Eastern Cape, South Africa1
Belgrade, Serbia1
Bellville, Texas, United States1
Belmont, Massachusetts, United States1
Berkhamsted, England, United Kingdom1
Bethlehem, Maryland, United States1
Between Cape Town and Berlin1
Bilthoven, Utrecht, The Netherlands1
Blagnac, Occitanie, France1
Blantyre, Southern Region, Malawi1
Bloemendaal, North Holland, Netherlands1
Bloemfontein, Free State, South Africa1
Bordeaux, Aquitaine-Limousin-Poitou-Charentes, France1
Bowden, South Australia, Australia1
Breda, North Brabant, Netherlands1
Bremen, Bremen, Germany1
Brest, Brittany, France1
Brighton, England, United Kingdom1
British Columbia, Canada1
Brooklyn, New York, United States1
Caddington, England, United Kingdom1
Canmore, Alberta, Canada1
Cape Town1
Cape Town, Western Cape1
Caribbean1
Carne, County Wexford, Ireland1
Cascais, Lisbon, Portugal1
Cayman Islands1
Chapel Row, England, United Kingdom1
Charleston, South Carolina, United States1
Charlotte, North Carolina, United States1
Charlottesville, Virginia, United States1
Charlton Kings, England, United Kingdom1
Chaville, Île-de-France, France1
Chesterfield, England, United Kingdom1
Chiawa, Lusaka Province, Zambia1
Chilton, England, United Kingdom1
City of Johannesburg Metropolitan Municipality, Gauteng, South Africa1
Clanwilliam, Western Cape, South Africa1
Collonges-sous-Salève, Auvergne-Rhône-Alpes, France1
Colorado Springs, Colorado, United States1
Como, Lombardy, Italy1
Compiègne, Nord-Pas-de-Calais-Picardie, France1
Corcelles-près-Concise, Vaud, Switzerland1
County Dublin, Ireland1
County Wicklow, Ireland1
Culver City, California, United States1
Cumbria, United Kingdom1
Czechia1
Darling, Western Cape, South Africa1
De Steeg, Gelderland, The Netherlands1
Denmark1
Denver, Colorado, United States1
Desio, Lombardy, Italy1
Deutschland1
DK1
Doha, Doha, Qatar1
Dordrecht, Zuid-Holland, Netherlands1
Dortmund, Nordrhein-Westfalen, Germany1
Dundee, Scotland, United Kingdom1
Durban Metro, KwaZulu-Natal, South Africa1
Durban North, KwaZulu-Natal, South Africa1
Durban, South Africa1
Elimbah, Queensland, Australia1
Emmeloord, Flevoland, Netherlands1
Engelskirchen, North Rhine-Westphalia, Germany1
England and South Africa1
Exeter, England, United Kingdom1
Farnham, England, United Kingdom1
Fish Hoek1
Fish Hoek, South Africa1
Flensburg, Schleswig-Holstein, Germany1
Florence, Tuscany, Italy1
Fouesnant, Brittany, France1
Franklin, Tennessee, United States1
Freetown, Western Area, Sierra Leone1
Frome, England, United Kingdom1
Gaborone, South-East, Botswana1
Galway, Galway, Ireland1
Geneva, Canton of Geneva, Switzerland1
Geneva, Geneva, Switzerland1
Gilbert, Arizona, United States1
Ginosa, Puglia, Italy1
GR1
Grahamstown, Eastern Cape, South Africa1
Granada, Andalusia, Spain1
Greyton, Western Cape, South Africa1
Groot Brakrivier, Western Cape, South Africa1
Gütersloh, North Rhine-Westphalia, Germany1
Haarby, Denmark1
Halifax, Nova Scotia, Canada1
Harare, Harare Province, Zimbabwe1
Hatteras, North Carolina, United States1
Helsinki, Uusimaa, Finland1
Herolds Bay, Western Cape, South Africa1
Herrliberg, Canton of Zurich, Switzerland1
Heywood, England, United Kingdom1
Hilton, KwaZulu-Natal, South Africa1
HK1
Hong Kong, Hong Kong1
Hout Bay, RSA1
Houten, Utrecht, The Netherlands1
Hyattsville, Maryland, United States1
I am so fortunate to live in beautiful Cape Town with nature right on our doorstep.1
IE1
IN1
Innsbruck, Tyrol, Austria1
Ireland1
IS1
IT1
Jakarta, Indonesia1
Jersey1
Jerusalem, Jerusalem District, Israel1
Johannesburg, South Africa1
JP1
Kampala, Central Region, Uganda1
Karlskrona, Blekinge County, Sweden1
Karlsruhe, Baden-Württemberg, Germany1
Katlehong, Gauteng, South Africa1
Kelowna, British Columbia, Canada1
Kempton Park, Gauteng, South Africa1
Kiel, Schleswig-Holstein, Germany1
Kinshasa, Kinshasa, Democratic Republic of the Congo1
Kirchzarten, Baden-Württemberg, Germany1
Kitzbuhel, Tyrol, Austria1
Knokke-Heist, Flanders, Belgium1
Königstein im Taunus, Hesse, Germany1
Kreuzlingen, Thurgau, Switzerland1
Krugersdorp, Gauteng, South Africa1
Kuala Lumpur, Federal Territory of Kuala Lumpur, Malaysia1
Kuwait City, Al Asimah, Kuwait1
Lagos, Nigeria1
Langnau am Albis, Zurich, Switzerland1
Lausanne, Vaud, Switzerland1
Lenham, England, United Kingdom1
Leuven, Flanders, Belgium1
Liège, Walloon Region, Belgium1
Limerick, Limerick, Ireland1
Lisse, Zuid-Holland, Netherlands1
Lobamba, Hhohho Region, Eswatini1
London (West End, Leicester Square, Soho)1
London, England1
London, Ontario, Canada1
London; UK1
Londres, Grande-Bretagne1
Long Crendon, England, United Kingdom1
Loosdrecht, North Holland, Netherlands1
LS1
Lucerne, Lucerne, Switzerland1
Lünen, North Rhine-Westphalia, Germany1
Lusaka, Lusaka Province, Zambia1
Luxembourg1
Luxembourg City, Luxembourg, Luxembourg1
Madrid, Community of Madrid, Spain1
Magor, Wales, United Kingdom1
Mahikeng, North West, South Africa1
Málaga, Andalusia, Spain1
Malmesbury, Western Cape, South Africa1
Männedorf, Zurich, Switzerland1
Marburg, Hesse, Germany1
Marrakesh, Marrakesh-Tensift-El Haouz, Morocco1
Maseru, Maseru, Lesotho1
Melbourne, Florida, United States1
Melsonby, England, United Kingdom1
Mettmann, North Rhine-Westphalia, Germany1
Miami, Florida, United States1
Middelburg - MP, Mpumalanga, South Africa1
Milan, Lombardy, Italy1
Mill Valley, California, United States1
Millburn, New Jersey, United States1
Mombasa, Mombasa, Kenya1
Montagu, Western Cape, South Africa1
Mossel Bay, Western Cape, South Africa1
Mountain Line Farm, The Crags, South Africa1
Mountain View, California, United States1
Mpumalanga, South Africa1
Msunduzi Municipality, KwaZulu-Natal, South Africa1
Muizenberg, Western Cape, South Africa1
Nairobi, Nairobi, Kenya1
Nancy, Lorraine, France1
Neuburg am Rhein, Rhineland-Palatinate, Germany1
Neuried, Bavaria, Germany1
New Delhi, Delhi, India1
New South Wales, Australia1
Ngorongoro, Arusha, Tanzania1
Niederbüren, St. Gallen, Switzerland1
Nienhagen, Lower Saxony, Germany1
Nkandla, KwaZulu-Natal, South Africa1
Noordhoek, Fishhoek our closest town1
Noordwijk, South Holland, Netherlands1
North West, South Africa1
Nottingham Road, KwaZulu-Natal, South Africa1
Nottingham, England, United Kingdom1
Nyon, Vaud, Switzerland1
NZ1
Oakland, California, United States1
Oberschleißheim, Bavaria, Germany1
Olathe, Kansas, United States1
Olifantsfontein, Gauteng, South Africa1
OM1
Omaha, Nebraska, United States1
Onrus, Western Cape, South Africa1
Oosterbeek, Gelderland, Netherlands1
Orgeval, Île-de-France, France1
Ottawa, Ontario, Canada1
Ottersberg, Lower Saxony, Germany1
Oxon Hill, Maryland, United States1
Palo Alto, California, United States1
Parow, Western Cape, South Africa1
Parys, Free State, South Africa1
Patricia and I live in a large house in central Cape Town which has an apartment attached. We are a few streets from Table Mountain National Park and within walking distance of the city. But there's also a good bus service and nearby shops & restaurants.1
Philippines1
Pietermaritzburg, KwaZulu-Natal, South Africa1
Pontpierre, Luxembourg District, Luxembourg1
Poole, England, United Kingdom1
Port Alfred, Eastern Cape, South Africa1
Port Louis, Port Louis District, Mauritius1
Port Saint Mary, Isle of Man1
Portland, Maine, United States1
Posadas, Misiones Province, Argentina1
Potsdam, Brandenburg, Germany1
Prague, Hlavní město Praha, Czech Republic1
Principauté de Monaco1
Prospect, South Australia, Australia1
QA1
Queensland, Australia1
Richmond, England, United Kingdom1
Riebeek West, South Africa1
Rixensart, Walloon Region, Belgium1
Rome, Lazio, Italy1
Rotterdam, South Holland, Netherlands1
Rotterdam, Zuid-Holland, The Netherlands1
RU1
Rüsselsheim, Hesse, Germany1
SA1
Saarbrücken, Saarland, Germany1
Sachsen bei Ansbach, Bavaria, Germany1
Saint Petersburg, Florida, United States1
Saint-Lubin-des-Joncherets, Centre, France1
Salt Spring Island, British Columbia, Canada1
San Diego, California, United States1
San Francisco (formerly Cape Town)1
San Gimignano, Tuscany, Italy1
Santa Cruz, California, United States1
Santa Monica, California, United States1
São Paulo, State of São Paulo, Brazil1
Sautel, Occitanie, France1
Schönau im Mühlkreis, Upper Austria, Austria1
Schülp bei Rendsburg, Schleswig-Holstein, Germany1
Seaforth, New South Wales, Australia1
Seefeld, Bavaria, Germany1
Sewickley, Pennsylvania, United States1
Simon's Town is located on the shores of False Bay on the Eastern side of the Cape Peninsula, a 45 minute drive from Cape Town1
Singapore, Republic of Singapore1
Singapore, Singapore1
Sliema, Malta1
Soustons, Aquitaine-Limousin-Poitou-Charentes, France1
South Africa1
south africa1
Southlake, Texas, United States1
Springs, Gauteng, South Africa1
Standerton, Mpumalanga, South Africa1
Standish, England, United Kingdom1
Starnberg, Bavaria, Germany1
Stavanger, Rogaland, Norway1
Stuart, Florida, United States1
Stuttgart, Baden-Württemberg, Germany1
Styria, Austria1
Suamico, Wisconsin, United States1
Sugar Land, Texas, United States1
Swakopmund, Erongo Region, Namibia1
Sweden1
Tain-l'Hermitage, Rhone-Alpes, France1
The Hague, South Holland, Netherlands1
The Hague, Zuid-Holland, The Netherlands1
The most beautiful city in the world.1
The Sands, England, United Kingdom1
This is our home1
Tulbagh, Western Cape, South Africa1
Umeå, Västerbotten County, Sweden1
United Arab Emirates1
Upington, Northern Cape, South Africa1
USA1
Utrecht, Utrecht, Netherlands1
Vaalwater, Limpopo, South Africa1
Vestby, Akershus, Norway1
Vilnius, Vilnius County, Lithuania1
Vredendal, Western Cape, South Africa1
Wales, United Kingdom1
Walvis Bay, Erongo Region, Namibia1
Wartburg, KwaZulu-Natal, South Africa1
Weingarten, Baden-Württemberg, Germany1
Wellington, Wellington, New Zealand1
West Beach, South Australia, Australia1
West Chicago, Illinois, United States1
West Horsley, England, United Kingdom1
Western Cape1
Wilderness, Western Cape, South Africa1
Winchester, England, United Kingdom1
Winchester, Massachusetts, United States1
Wokingham, England, United Kingdom1
Wolseley, Western Cape, South Africa1
York, England, United Kingdom1
Zell am See, Salzburg, Austria1
ZM1
Zug, Canton of Zug, Switzerland1
한국1
However, since there are many different values and this variable does not play a significant part in our further analysis, we choose to disregard it.

CATEGORICAL VARIABLES

Lastly, we find several categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values).

  • host_response_time (as specified by Airbnb, e.g. “within an hour”, “within a day”)

  • host_response_rate (between 0% and 100%)

  • host_acceptance_rate (between 0% and 100%)

  • host_is_superhost (TRUE or FALSE)

  • host_has_profile_pic (TRUE or FALSE)

  • host_identity_verified (TRUE or FALSE)

  • neighbourhood_cleansed (limited number of Wards)

  • city (limited set of cities that belong to the area)

  • state (only one state, “Western Cape”)

  • zipcode (limited set of zip codes for the area)

  • market (should all be “Cape Town”)

  • country_code (should all be “ZA”)

  • country (should all be “South Africa”)

TEXT VARIABLES

We further find these following variables which are filled with descriptive text or are just URL-addresses and will, thus, also not be relevant to our analysis.

  • name
  • summary
  • space
  • description
  • neighborhood_overview
  • notes
  • transit
  • access
  • interaction
  • house_rules
  • host_about
  • listing_url
  • thumbnail_url (all NAs)
  • medium_url (all NAs)
  • picture_url
  • xl_picture_url (all NAs)
  • host_url
  • host_thumbnail_url
  • host_picture_url
listings_version2 <- listings %>% 
#deselecting all text variables
    dplyr::select(-name, -summary, -space, -description,
                  -neighborhood_overview, -notes, -transit, -access,
                  -interaction, -house_rules, -host_about, -listing_url,
                  -thumbnail_url, -medium_url, -picture_url, -xl_picture_url, 
                  -host_url, -host_thumbnail_url, -host_picture_url)

MISSING VALUES 2

While some variables have 0 missing values, a lot of variables have some missing values. However, we do find 8 variables that miss all values, namely that only (or for one mainly) include NAs (Not Available). We will eliminate these when cleaning the data set.

  • host_neighbourhood
  • thumbnail_url
  • medium_url
  • xl_picture_url
  • neighbourhood
  • neighbourhood_group_cleansed
  • jurisdiction_names
  • license (mainly NAs except for a few irrelevant values)
listings_version3 <- listings_version2 %>% 
#deselecting variables with only or mainly NAs
    dplyr::select(-host_neighbourhood, -neighbourhood, -neighbourhood_group_cleansed, 
                  -jurisdiction_names, -license)

We can also see that all observations have a value for the experiences_offered variable. As there is only 1 unique value, which is “none”, we will can discard this variable, too as it adds no value to our analysis.

listings_version3 <- listings_version3 %>% 
#deselecting the column `experiences_offered`
  dplyr::select(-experiences_offered)

The target variable is the cost for two people to stay at an Airbnb location for four (4) nights, namely (price+extra_people)*4. So we create a new variable price_4_nights in our dataset ‘listings_version3’.

listings_version3 <- listings_version3 %>% 
#creating a new variable `price_4_nights`
  mutate(price_4_nights = ifelse(guests_included < 2, 4*(price+extra_people) + cleaning_fee, 
                                 price*4+cleaning_fee))

Favstats

Lastly, we will have a look at favstats() of the numeric variables that we specified as relevant earlier:

#favstats related numerical variables to examine rough distribution/outliers
favstats(listings_version3$price)
minQ1medianQ3maxmeansdnmissing
1216249881.7e+033e+052.12e+037.54e+03189540
favstats(listings_version3$price_4_nights)
minQ1medianQ3maxmeansdnmissing
4842.81e+034.21e+037.21e+031.2e+068.8e+033.02e+04189540
favstats(listings_version3$number_of_reviews)
minQ1medianQ3maxmeansdnmissing
0041742516.431.2189540
favstats(listings_version3$review_scores_rating)
minQ1medianQ3maxmeansdnmissing
2094.6959910094.57.5189540
favstats(listings_version3$accommodates)
minQ1medianQ3maxmeansdnmissing
1234403.572.31189540
favstats(listings_version3$longitude)
minQ1medianQ3maxmeansdnmissing
18.318.418.418.518.918.50.126189540
favstats(listings_version3$latitude)
minQ1medianQ3maxmeansdnmissing
-34.3-34-33.9-33.9-33.5-340.0998189540

It seems there exist fairly isolated outliers for these numeric variables.

Meanwhile, we can double check from favstate results that all of the 7 variables have 18954 counts, namely no missing values after we have selected the observations with minimum_nights <= 4.

The maximum of price is 300,009, almost 300 (300,009/988) times higher than the median. The standard deviation for price is 7,537.011, signaling high variability.

For price_4_nights , the max is 1,200,036; the mean (8801) is larger than median (4212), indicating that the distribution could be right-skewed.

For number_of_reviews, mean is similar to Q3 and the maximum is way higher than median. This is possibly because a large number of “0”s exist in the column.

For review_scores_rating, most values are within 94.5 and 99, and the mean is 94, indicating that review_scores_rating are fairly high for Cape Town properties.

For accommodates, most properties can have 2 to 4 people, while the largest accommodates 40.

For longitude(latitude), all values are clustered around “18.4”(“-34.0”). This verifies that we are getting right observations in one certain area, namely Cape Town.

Overall, the results of favstats show there are outliers that could distrupt our analysis. Thus in each visualization, we would like to limit the numerical values to its [quantile 10%, quantile 90%].

Informative visualisations

Graphs of variables of interest

Density Plot

  • price

The distribution is right-skewed and prices (per night) are concentrated around $550.

plot_of_price <- listings_version3 %>% 
  filter(price < quantile(price,0.9), price > quantile(price,0.1)) %>% 
  ggplot(aes(x=price)) + 
  geom_density()+
  labs(title="Distribution of price")+
  scale_x_continuous(breaks = seq(0,4000,500))+
  theme_bw()

plot_of_price

  • price_4_nights

The distribution has one peak and price for 4 nights (for two people actually) are concentrated around $25000.

# remove scientific notation
options(scipen=999)

# plot
plot_of_price_4_nights <- listings_version3 %>% 
   filter(cleaning_fee < quantile(price_4_nights,0.9), cleaning_fee > quantile(price_4_nights,0.1)) %>% 
  ggplot(aes(x=price_4_nights)) + 
  geom_density()+
  labs(title="Distribution of price_4_nights")+
  scale_x_continuous(breaks = seq(0,220000,25000))+
  theme_bw()

plot_of_price_4_nights

  • guest_included

The distribution has multiple peaks, indicating that for the price of most properties, only 1 or 2 guests are included. It is likely that when more than two people want to book a property, they will have to pay the extra money (extra_people).

plot_of_guests_included <- listings_version3 %>% 
   #filter(guests_included < quantile(guests_included,0.9), guests_included > quantile(guests_included,0.1)) %>% 
  ggplot(aes(x=guests_included)) + 
  geom_density()+
  labs(title="Distribution of guest included")+
  scale_x_continuous(breaks = seq(0,20,1))+
  theme_bw()

plot_of_guests_included

  • extra_people

The distribution has four peaks, indicating that the charge for one extra person are likely to be $100, $150, $200, $250.

plot_of_extra_people <- listings_version3 %>% 
  filter(extra_people < quantile(extra_people,0.9), extra_people > quantile(extra_people,0.1)) %>%
  ggplot(aes(x=extra_people)) + 
  geom_density()+
  labs(title="Distribution of extra people")+
  theme_bw()

plot_of_extra_people

  • minimum_nights

Most properties require the stay to be around one to two nights, some require three nights. This corresponds to what we have analysed in the “2.3.1 Skim-Minimum nights” section. Hence we get rid of the observations with over 4 nights.

plot_of_minimum_nights <- listings_version3%>% 
  ggplot(aes(x=minimum_nights)) + 
  geom_density()+
  labs(title="Distribution of minimum nights")+
  scale_x_continuous(breaks = seq(0,5,1))+
  theme_bw()

plot_of_minimum_nights

  • number_of_reviews

The distribution is bi-modal, indicating that the numbers of reviews are concentrated below 5.

plot_of_number_of_reviews <- listings_version3%>% 
  filter(number_of_reviews < quantile(number_of_reviews,0.9), number_of_reviews > quantile(number_of_reviews,0.1)) %>%
  ggplot(aes(x=number_of_reviews)) + 
  geom_density()+
  labs(title="Distribution of number of reviews")+
  scale_x_continuous(breaks = seq(0,55,10))+
  theme_bw()

plot_of_number_of_reviews 

  • review_scores_rating

Most review scores rating is above 90, and are most concentrated around 97.5, indicating that most properties receive decent ratings.

plot_of_review_scores_rating <- listings_version3%>% 
  filter(review_scores_rating < quantile(review_scores_rating,0.9), review_scores_rating > quantile(review_scores_rating,0.1)) %>%
  ggplot(aes(x=review_scores_rating)) + 
  geom_density()+
  labs(title="Distribution of review scores rating")+
  theme_bw()

plot_of_review_scores_rating

  • host_listings_count

Most host have around 1 to 10 listings, and the peak is at 1.

plot_of_host_listings_count <- listings_version3%>% 
  ggplot(aes(x=host_listings_count)) + 
  geom_density()+
  labs(title="Distribution of host listings count")+
  xlim(c(0,50)) +
  theme_bw()

plot_of_host_listings_count

  • security_deposit

Most properties do not require security deposit: NA + 0 consist of 13182/18964 (69.5%) of the whole dataset.

plot_of_security_deposit <- listings_version3%>% 
  ggplot(aes(x=security_deposit)) + 
  geom_density()+
  labs(title="Distribution of security deposit")+
  xlim(c(0,10)) +
  theme_bw()

plot_of_security_deposit

histogram

For histograms, we choose to plot accommodates (the maximum number of people) as it reflects beds, bathrooms, and bedrooms combined.

  • accommodates

Overall, most accommodates are concentrated around 2 and 4. But the scenario varies from type to type. Because we filtered out outliers before when creating the listings_version3_filtered, we are losing information on villa properties. So we plot an additional histogram for villa accommodates.

plot_of_accommodates <- listings_version3 %>% 
  ggplot(aes(x=accommodates)) + 
  geom_histogram(binwidth = 1)+
  labs(title="Distribution of accommodates")+
  xlim(0,15)+
  #scale_x_continuous(breaks = seq(1,14,1))+
  theme_bw()

plot_of_accommodates

plot_of_accommodates_by_prop_type <- plot_of_accommodates+
  facet_wrap(~prop_type_simplified,scale="free")+
  labs(title="Distribution of accommodates by property type")

plot_of_accommodates_by_prop_type

Explore villa accommodates

As the histograms shows, villas accommodate more people than the rest as a large proportion of values are concentrated around 8.

listings  %>% 
  filter(prop_type_simplified=="Villa") %>% 
  count()
n
710
plot_of_accommodates_villa <- listings_version3 %>% 
  filter(prop_type_simplified=="Villa") %>% 
  ggplot(aes(x=accommodates)) + 
  geom_histogram(binwidth = 1)+
  labs(title="Distribution of villa accommodates")+
  #xlim(0,15)+
  scale_x_continuous(breaks = seq(1,20,1))+
  theme_bw()

plot_of_accommodates_villa

Bar Plot

For the bar plots we are using the original dataset listings_version3 because we do not want to lose too much data.

  • room_type

Most of the rooms are entire home/apartment, followed by private room.

data_of_room_type <- listings_version3%>% 
 group_by(room_type) %>% 
 summarise(count = n()) %>%
 mutate(room_type = fct_reorder(room_type, desc(count)))

plot_of_room_type <-  data_of_room_type %>% 
  ggplot(aes(x=room_type,y=count)) +
  geom_bar(stat = "identity")+
  geom_text(aes(label=count), vjust=1.6, color="white", size=3.5)+
  theme_bw()

plot_of_room_type

  • neighbourhood_simplified
data_of_neighbourhood_simplified <- listings_version3 %>% 
 group_by(neighbourhood_simplified) %>% 
 summarise(count = n()) %>%
 mutate(neighbourhood_simplified = fct_reorder(neighbourhood_simplified, desc(count)))

plot_of_neighbourhood_simplified <- data_of_neighbourhood_simplified %>% 
  ggplot(aes(x=neighbourhood_simplified,y=count)) +
  labs(title="Most properties are located in Atlantic Seaboard")+
  geom_bar(stat = "identity")+
  theme_bw()


plot_of_neighbourhood_simplified

  • prop_type_simplified
data_of_prop_type_simplified <- listings_version3 %>% 
 group_by(prop_type_simplified) %>% 
 summarise(count = n()) %>%
 mutate(prop_type_simplified= fct_reorder(prop_type_simplified,desc(count)))

plot_of_prop_type_simplified <- data_of_prop_type_simplified %>% 
  ggplot(aes(x=prop_type_simplified,y=count)) +
  labs(title="Most properties are apartment, followed by house")+
  geom_bar(stat = "identity")+
  theme_bw()

plot_of_prop_type_simplified

  • host_is_superhost
plot_of_host_is_superhost <- listings_version3%>% 
  filter(!is.na(host_is_superhost)) %>% 
  ggplot(aes(x=host_is_superhost)) +
  labs(title="Most hosts are not superhost")+
  geom_bar()+
  theme_bw()

plot_of_host_is_superhost

  • host_identity_verified
plot_of_host_identity_verified <- listings_version3%>% 
  filter(!is.na(host_identity_verified)) %>% 
  ggplot(aes(x=host_identity_verified)) +
  labs(title="Most hosts are not verified")+
  geom_bar()+
  theme_bw()

plot_of_host_identity_verified

  • cancellation_policy

Most properties have a cancellation policy of strict 14 days with grace period.

# check out the type of cancellation policy
typeof(listings_version3$cancellation_policy)
## [1] "character"
# make it a factor variable and prepare the data for plotting
data_of_cancellation_policy <- listings_version3 %>% 
 dplyr::select(cancellation_policy)  %>% 
 mutate(cancellation_policy_f = as.factor(cancellation_policy)) %>% 
 group_by(cancellation_policy_f) %>% 
 summarise(count = n()) %>%
 mutate(cancellation_policy_f= fct_reorder(cancellation_policy_f,desc(count)))

# plot 
plot_of_cancellation_policy <- data_of_cancellation_policy %>% 
  ggplot(aes(x=cancellation_policy_f,y=count)) +
  geom_bar(stat = "identity")+
  theme_bw()

plot_of_cancellation_policy 

Correlation matrix

To measure correlation between variables, we use correlation matrix to generate realationship plots and correlation coefficents, which measure both the strength and direction of the linear relationship between two continuous variables.

Before plotting

After inspecting the cleaned dataset, we want to focus on the main range of each variable to detect obvious relationships for further informative visualisations. So, we decide to limit values to 10%-90% percentile.

#quantile(listings_version3$price, probs = c(0.1,0.9))
#quantile(listings_version3$cleaning_fee, probs = c(0.1,0.9))
#quantile(listings_version3$extra_people, probs = c(0.1,0.9))
#quantile(listings_version3$number_of_reviews, probs = c(0.1,0.9))
#quantile(listings_version3$review_scores_rating, probs = c(0.1,0.9))
listings_version3_filtered <- listings_version3 %>% 
                    filter(price_4_nights < quantile(price_4_nights,0.9), price_4_nights > quantile(price_4_nights,0.1),
                           number_of_reviews < quantile(number_of_reviews,0.9), number_of_reviews > quantile(number_of_reviews,0.1),
                          review_scores_rating < quantile(review_scores_rating,0.9), review_scores_rating > quantile(review_scores_rating,0.1),
                          accommodates < quantile(accommodates,0.9), accommodates > quantile(accommodates,0.1))

There is one thing to notice: this listings_version3_filtered is an intersection of all the related variables within [quantile 10%, quantile 90%]. Consequently this is specifically for the production of informative plots within several variables at one time, such as “correlation matrix”.

Correlation matrix in numeric variables

# produce scatterplot-correlation matrix using GGally::ggpairs()
# remove scientific notation
options(scipen=999)
# listings
correlation_matrix1 <- listings_version3_filtered %>% 
  dplyr::select(price_4_nights, number_of_reviews, review_scores_rating, accommodates) %>% 
  GGally::ggpairs(alpha = 0.4)
   
#load library here
library(here)
ggsave("correlation_matrix1.jpg",plot = correlation_matrix1 ,path=here::here("images"), width = 17, height = 10)  
  knitr::include_graphics(here::here("images", "correlation_matrix1.jpg"))

Interpreting correlations between variables

From the correlation matrix above, we can see thataccommodates and price_4_nights are positively correlated with an coefficient of 0.234.

Significant correlation also exists between number_of_reviews & review_scores_rating(0.354). number_of_reviews is significantly correlated with accomodates, sharing a correlation coefficient of -0.075** .

Correlation matrix in neighbourhood_simplified and numerical variables

# produce scatterplot-correlation matrix using GGally::ggpairs()
# remove scientific notation
options(scipen=999)
correlation_matrix2 <-listings_version3_filtered %>% 
  dplyr::select(price_4_nights, number_of_reviews, review_scores_rating, accommodates,neighbourhood_simplified) %>% 
  GGally::ggpairs(aes(color = neighbourhood_simplified), alpha = 0.4)


#load library here
library(here)
ggsave("correlation_matrix2.jpg", plot = correlation_matrix2, path=here::here("images"), width = 17, height = 10)  
  knitr::include_graphics(here::here("images", "correlation_matrix2.jpg"))

Interpreting correlations between variables

Categorizing by neighbourhood_simplified, we can specify clear correlations between numerical variables in details. price_4_nights is significantly correlated with number_of_reviews in “Atlantic Seaboard”(-0.071) and “CBD”(-0.276),

price is significantly correlated with review_scores_rating in “Southern Suburbs”(0.569).

The above two types of trends show that different neighbourhoods do have different levels of impact on the relationship between price_4_nights and number_of_reviews / review_scores_rating.

Take “Southern Suburbs” as an example. Cape Town’s “Southern Suburbs” is identified as being the more affluent of the Cape Town Metropolis’ sections and includes the city’s most expensive residential neighbourhoods. Hence here may exist more evident linear relationship between price_4_nights and review_scores_rating since people think of the experience of the trip and accommodation fully deserves the price. We will further look at whether this correlation appears to be conditional on the value of neighbourhood_simplified below.

From the three boxplots above, we can conclude the distribution of price_4_nights, number_of_reviews and review_scores_rating among different neighbourhood_simplified. Listings in “Atlantic seaboard” always rank ahead in the price_4_nights and number_of_reviews.

While price_4_nights and number_of_reviews of listings in “Atlantic Seaboard” are above average, review_scores_rating in this area is inconsistently low. From our knowledge, “Atlantic Seaboard” is often used by locals to describe the residential/ commercial areas facing the Atlantic seaboard and in front of table mountain. And the low rating may be because people do not think of luxury apartments as a good bargain.

Correlation matrix in prop_type_simplified and numerical variables

# produce scatterplot-correlation matrix using GGally::ggpairs()
# remove scientific notation
options(scipen=999)
correlation_matrix3 <-listings_version3_filtered %>% 
  dplyr::select(price_4_nights, number_of_reviews, review_scores_rating, accommodates,prop_type_simplified) %>% 
  GGally::ggpairs(aes(color =  prop_type_simplified), alpha = 0.4)

# load library here
library(here)
ggsave("correlation_matrix3.jpg", plot = correlation_matrix3, path=here::here("images"), width = 17, height = 10)  
  knitr::include_graphics(here::here("images", "correlation_matrix3.jpg"))

Interpreting correlations between variables

Categorizing by prop_type_simplified, we can specify clear correlations between numerical variables in details. Indeed, different types of property do have different levels of impact on the relationship between price and other variable.

price_4_nights is significantly correlated with number_of_reviews in “guest suite” (0.188), while being non-significant in other property types. Overall, price_4_nights and review_scores_rating are significantly correlated: apartment (0.196), house(0.261), other(0.229).

Regarding the relationship between review_scores_rating and number_of_reviews, there exists a significant correlation however less obvious among villas.

Now let’s delve into the property type. From the boxplots above, we can conclude the distribution of price_4_nights, number_of_reviews and review_scores_rating among different prop_type_simplified. Overall, “Guests suide” have the highest number_of_reviews and review_scores_rating although the price_4_nights for it is fairly low. This is probably because “Guests suide” has fewer data points. Meanwhile, people might think of it as the most rewarding experience with good price.

Correlation matrix in room_type and numerical variables

# produce scatterplot-correlation matrix using GGally::ggpairs()
# remove scientific notation
options(scipen=999)
correlation_matrix4 <-listings_version3_filtered %>% 
# filter variables of interest 
  dplyr::select(price_4_nights, number_of_reviews, review_scores_rating, accommodates,room_type) %>% 
  GGally::ggpairs(aes(color = as.factor(room_type)), alpha = 0.4)

# load library here
library(here)
ggsave("correlation_matrix4.jpg", plot = correlation_matrix4, path=here::here("images"), width = 17, height = 10)  
knitr::include_graphics(here::here("images", "correlation_matrix4.jpg"))

Interpreting correlations between variables

Categorizing by prop_type_simplified, we can specify clear correlations between numerical variables in details. Indeed, different types of property do have different levels of impact on the relationship between price and other variable.

price_4_nights ’s negative correlation with number_of_reviews is only significant among entire home/apartment (-0.004), but the positive correlation with accommodates is significant among all property types except for hotel rooms. It could be that hotel rooms usually charge under their operation rules so the impact of accommodates are less significant.

Regarding the boxplots, it is evident that the price_4_nights varies from type to type, with hotel room being the most expensive. Please note that we filtered data here so the findings are just for reference.

Correlations conditional on categorical variable

Plot1

price_4_nights X number_of_reviews by neighborhood_simplified

price_4_nights_number_of_reviews_1 <- listings_version3 %>% 
  filter(price_4_nights < quantile(price_4_nights,0.9), price_4_nights > quantile(price_4_nights,0.1)) %>% 
  filter(number_of_reviews < quantile(number_of_reviews,0.9), number_of_reviews > quantile(number_of_reviews,0.1)) %>%
  ggplot2::ggplot(aes(x = number_of_reviews , y = price_4_nights))+
  geom_point()+
  geom_smooth()+
  facet_wrap(~neighbourhood_simplified)+
  labs(title = "'CBD' and 'Atlantic Seaboard' sees strong relationship between price_4_nights and number_of_reviews", 
       subtitle = "correlation comparison conditional on neighbourhoods",
       x = "number_of_reviews")+
  theme_bw()+
  NULL


ggsave("price_4_nights_number_of_reviews_1.jpg", plot = price_4_nights_number_of_reviews_1, path=here::here("images"), width = 15, height = 10)  
knitr::include_graphics(here::here("images", "price_4_nights_number_of_reviews_1.jpg"))

The graph above appears to be indicating correlations conditional on neighbourhoods.“Cape Flats”, “Sommerset”, and “Southern Suburbs” have fewer observation points than “Atlantic Seaboard” and “CBD”, which might be a guess answer to the weak correlations.

It can be concluded that properties in different regions attract visitors to leave reviews at different levels, though probably not that obvious considering they still belong to the same metropolitan. For instance, people who go on business trip tend not to leave a review, while leisure travellers tend to do that.

Plot2

price_4_nights X accommodates by neighborhood_simplified

price_4_nights_accommodates_byneighborhood <- listings_version3 %>% 
  filter(price_4_nights < quantile(price_4_nights,0.9), price_4_nights > quantile(price_4_nights,0.1)) %>% 
  filter(accommodates < quantile(accommodates,0.9), accommodates > quantile(accommodates,0.1)) %>% 
  ggplot2::ggplot(aes(x = as.factor(accommodates), y = price_4_nights))+
  geom_boxplot()+
  facet_wrap(~neighbourhood_simplified)+
  #scale_x_continuous(limits = c(75,300))+
  #scale_y_continuous(limits = c(0,3000))+
  labs(title = " 'Atlantic Seaboard' sees highest average price_4_nights across accommodates", 
       subtitle = "Correlation comparison conditional on neighbourhoods",
       x = "accommodates")+
  theme_bw()+
  NULL

ggsave("price_4_nights_accommodates_byneighborhood.jpg", plot = price_4_nights_accommodates_byneighborhood, path=here::here("images"), width = 15, height = 10)  
knitr::include_graphics(here::here("images", "price_4_nights_accommodates_byneighborhood.jpg"))

Plot3

price_4_nights X accommodates by prop_type_simplified

price_4_nights_accommodates_byprop<- listings_version3 %>% 
  filter(price_4_nights < quantile(price_4_nights,0.9), price_4_nights > quantile(price_4_nights,0.1)) %>% 
  filter(accommodates < quantile(accommodates,0.9), accommodates > quantile(accommodates,0.1)) %>% 
  ggplot2::ggplot(aes(x = as.factor(accommodates), y = price_4_nights))+
  geom_boxplot()+
  facet_wrap(~prop_type_simplified)+
  #scale_x_continuous(limits = c(75,300))+
  #scale_y_continuous(limits = c(0,3000))+
  labs(title = " 'House' sees most significant positive relationship ", 
       subtitle = "Correlation comparison conditional on property type",
       x = "accommodates")+
  theme_bw()+
  NULL

ggsave("price_4_nights_accommodates_byprop.jpg", plot = price_4_nights_accommodates_byprop, path=here::here("images"), width = 15, height = 10)  
knitr::include_graphics(here::here("images", "price_4_nights_accommodates_byprop.jpg"))

Mapping

Prepare the dataset for mapping

library(kableExtra)

# check out the unique values for each variable
# for property_type
unique(listings$property_type)
##  [1] "Guesthouse"             "House"                  "Apartment"             
##  [4] "Other"                  "Guest suite"            "Villa"                 
##  [7] "Serviced apartment"     "Bed and breakfast"      "Bungalow"              
## [10] "Loft"                   "Cottage"                "Condominium"           
## [13] "Hotel"                  "Townhouse"              "Hostel"                
## [16] "Boutique hotel"         "Cabin"                  "Chalet"                
## [19] "Aparthotel"             "Tiny house"             "Farm stay"             
## [22] "Lighthouse"             "Treehouse"              "Tent"                  
## [25] "Camper/RV"              "Resort"                 "Nature lodge"          
## [28] "Cave"                   "Earth house"            "Hut"                   
## [31] "Boat"                   "Bus"                    "Vacation home"         
## [34] "Castle"                 "Casa particular (Cuba)" "Dorm"                  
## [37] "Heritage hotel (India)" "Tipi"                   "Barn"                  
## [40] "Dome house"
table(listings$property_type) %>% 
                    kbl()
Var1 Freq
Aparthotel 29
Apartment 8046
Barn 2
Bed and breakfast 606
Boat 6
Boutique hotel 162
Bungalow 107
Bus 1
Cabin 27
Camper/RV 9
Casa particular (Cuba) 2
Castle 2
Cave 1
Chalet 45
Condominium 470
Cottage 385
Dome house 1
Dorm 5
Earth house 11
Farm stay 50
Guest suite 1402
Guesthouse 958
Heritage hotel (India) 1
Hostel 91
Hotel 27
House 4513
Hut 2
Lighthouse 2
Loft 281
Nature lodge 14
Other 86
Resort 5
Serviced apartment 441
Tent 1
Tiny house 32
Tipi 4
Townhouse 414
Treehouse 2
Vacation home 1
Villa 710
# use listings because we need the url
listings_formapping   <- listings %>% 
                                  dplyr::select(longitude, latitude, listing_url, property_type,price,minimum_nights,prop_type_simplified)


# before mapping color to price, check price distribution
# check summary statistics
favstats(listings_formapping$price)
minQ1medianQ3maxmeansdnmissing
1216249881.7e+033e+052.12e+037.54e+03189540
# Quantile 10% to quantile 90% 
quantile(listings_formapping$price,probs = c(0.1,0.9))
##  10%  90% 
##  416 3453
# filtering some data
listings_formapping%>% 
  ggplot(aes(x=price)) + 
    geom_histogram(fill='#69b3a2', color='white') +
    xlab("Price per night") + 
    xlim(c(400,3500))+ 
    theme_bw()

# create the dataset of properties we need to map
listings_formapping_final <- listings_formapping %>% 
                                filter(price<=3453.4, price >= 416,minimum_nights <= 4)

Data cleanings

We found that for price per night, the distribution is right-skwed, with the 10% quantile of 416.0 and the 90% quantile of 3453.4. And since we will color the properties according to its price, we filter the outliers smaller than 416 and larger than 3453.4 and make a new dataset just for the leaflet mapping.

Actual mapping

The basic map

The variables we use are price, longitude, latitude, listing_url, and property_type. The label shows the prop_type_simplified ; the pop up shows its price per night and the corresponding url.

# load the leaflet package
library(leaflet)
# just use price per se
# add layers indicating the property type
library(RColorBrewer)
# assign color palette
# we can do colorNumeric or colorQuantile but to display price, we choose the former
pal <- colorNumeric(
  palette = "PuBuGn",
  domain = c(0,3500),
  n = 5
  )

# create a  basic map
map1 <- leaflet(data =listings_formapping_final
        ) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  # set the initial view
  setView( lat=-34, lng=18.5 , zoom=8) %>%
  # add the circles
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   color = ~pal(price),
                   fillOpacity = 0.4, 
                   # popup shows price per night and url
                   popup = ~paste0("<b>","$",price," per night","</b>"," <br/>",listing_url),
                   label = ~prop_type_simplified) %>% 
  # add legends
    addLegend(pal=pal, values=~price, opacity=0.9, 
              title = "Price per night", position = "bottomleft")

Airbnb listings in Cape Town

The map with layers of property types

For a clearer view, we use Leaflet’s layers control feature for this new mapping. We can toggle the visibility of different property types to show by clicking on the layer control. And the properties are colored based on its price.

# check the property type simplified (we have 5 groups)
listings_formapping_final %>% 
  count(prop_type_simplified) %>% 
  arrange(desc(n))  
prop_type_simplifiedn
Apartment7105
Other3648
House3083
Guest suite1160
Villa227
# create a more complicated map with layers indicating property types
map2_base <- leaflet(data =listings_formapping_final
        ) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  # set the initial view
  setView( lat=-34, lng=18.5 , zoom=10) 
  

# create individual datasets for each property type
m2_apartment <- listings_formapping_final %>% filter(prop_type_simplified == "Apartment")
m2_other <- listings_formapping_final %>% filter(prop_type_simplified == "Other")
m2_house <- listings_formapping_final %>% filter(prop_type_simplified == "House")
m2_guest <- listings_formapping_final %>% filter(prop_type_simplified == "Guest suite")
m2_villa <- listings_formapping_final %>% filter(prop_type_simplified == "Villa")


  # add the circles for Apartment
map2_with_layers <- map2_base %>% addCircleMarkers(data = m2_apartment,
                   lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   color = ~pal(price),
                   fillOpacity = 0.4, 
                   # set the pop up to be listing url
                   # popup shows price per night and url
                   popup = ~paste0("<b>","$",price," per night","</b>"," <br/>",listing_url),
                   # set the label to be property_type
                   label = ~property_type,
                   group = "Apartment") %>% 
  
  # add the circles for House
    addCircleMarkers(data = m2_house,
                   lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   color = ~pal(price),
                   fillOpacity = 0.4, 
                  # popup shows price per night and url
                   popup = ~paste0("<b>","$",price," per night","</b>"," <br/>",listing_url),
                   label = ~property_type,
                   group = "House") %>% 
  
  # add the circles for Guest suite
    addCircleMarkers(data = m2_guest,
                   lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   color = ~pal(price),
                   fillOpacity = 0.4, 
                   # popup shows price per night and url
                   popup = ~paste0("<b>","$",price," per night","</b>"," <br/>",listing_url),
                   label = ~property_type,
                   group = "Guest suite") %>% 
  
  # add the circles for Villa
    addCircleMarkers(data = m2_villa,
                   lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   color = ~pal(price),
                   fillOpacity = 0.4, 
                   # popup shows price per night and url
                   popup = ~paste0("<b>","$",price," per night","</b>"," <br/>",listing_url),
                   label = ~property_type,
                   group = "Villa") %>% 
  
    # add the circles for Other
    addCircleMarkers( data = m2_other,
                lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   color = ~pal(price),
                   fillOpacity = 0.4, 
                  # popup shows price per night and url
                   popup = ~paste0("<b>","$",price," per night","</b>"," <br/>",listing_url),
                # here we used the original property type for group other
                   label = ~property_type,
                   group = "Other") %>% 
  
  # add layer control
  addLayersControl(overlayGroups = c("Apartment","House","Guest suite","Villa","Other")) %>% 
  # add legends
    addLegend(pal=pal, values=~price, opacity=0.9, 
              title = "Price per night", position = "bottomleft")

Airbnb listings in Cape Town by property type

Initial Findings

Based on the map above, we observe location and price difference between various types of properties.

  • For apartments, those near the coastline, for example Three Anchor Bay (belongs to Atlantic seaboard), have higher price per night.
  • For houses, those located in Newlands, Bishopscoutr (belong to Southern Suburbs) are more expensive.
  • For guest suites, those located in the Glen, Bakoven, and Rontree (also belong to Atlantic seaboard) cost more.
  • For villas, the properties seem rather scattered around and are mostly priced above $1,500 per night.
  • For properties that belong to other types, the distribution looks similar to that of apartments.

To conclude:

  • In terms of location, properties near the coastlines (Atlantic seaboard) are more likely to be expensive because they have better views.
  • In terms of the property type and price, houses and villas generally have higher price per night, while guest suites have lower price per night.

Regression & Model Diagnostics

Checking the underlying distribution and applying necessary transformations

We begin our regression by plotting the underlying distribution of the ‘price_4_nights’ variable to check for normality of the dependent variable price_4_nights.

ggplot(data=listings_version3, aes(x=price_4_nights))+
  geom_density()+
  labs(title = "Density of  Price for 2 people for 4 nights at an AirBNB")+
  theme_bw()

ggplot(data=listings_version3, aes(x=log(price_4_nights)))+
  geom_density()+
  labs(title = "Density of  log transformed Price for 2 people for 4 nights at an AirBNB")+
  theme_bw()

ggplot(data=listings_version3, aes(x=sqrt(price_4_nights)))+
  geom_density()+
  labs(title = "Density of square root transformed Price for 2 people for 4 nights at an AirBNB")+
  theme_bw()

ggplot(data=listings_version3, aes(bcPower(price_4_nights,lambda=-0.25)))+
  geom_density()+
  labs(title = "Density of box-cox transformed Price for 2 people for 4 nights at an AirBNB")+
  theme_bw()

When analyzing the underlying distribution of price for 2 people for 4 nights, we notice the underlying data is not normally distributed and running a regression will violate the normality assumption.We then attempted a number of transformations to correct this issue. First we tried a log transformation on the dependent Y (price_4_nights) which seemed to correct the distribution, while also attempting to both a square root and box cox transformation(bcPower) with lamda=-0.25, the square root did not seem to correct the non-normality, while the box-cox transformation seemed to improve the non-normality and skewness the most. For this reason we will consider 3 possible models, namely: the un-transformed ,log-transformed and box-cox transformed and discuss each of their diagnostics throughout.

library(rsample) #here we divide the data in modeling and test data
set.seed(1235)# for reproducibility set seed

model_test_split <- initial_split(listings_version3, prop=0.8) #splitting the data into a modeling set and testing set. Choosing an 80-20 split
listings_model <- training(model_test_split)#model data group
listing_test <- testing(model_test_split)#testing set

Initial model considering Property Types,Number of Reviews and Review scores as possible influential factors.

Un-transformed model

model1 <-lm(price_4_nights~prop_type_simplified+number_of_reviews+review_scores_rating, data=listings_model)

summary(model1)
## 
## Call:
## lm(formula = price_4_nights ~ prop_type_simplified + number_of_reviews + 
##     review_scores_rating, data = listings_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -54873   -4150   -2312     346 1192468 
## 
## Coefficients:
##                                 Estimate Std. Error t value
## (Intercept)                      2658.44    2594.60    1.02
## prop_type_simplifiedGuest suite -2491.34     830.46   -3.00
## prop_type_simplifiedHouse        1658.67     526.61    3.15
## prop_type_simplifiedOther         633.15     531.96    1.19
## prop_type_simplifiedVilla       49620.25    1123.39   44.17
## number_of_reviews                 -33.38       6.63   -5.03
## review_scores_rating               45.21      27.48    1.65
##                                             Pr(>|t|)    
## (Intercept)                                   0.3056    
## prop_type_simplifiedGuest suite               0.0027 ** 
## prop_type_simplifiedHouse                     0.0016 ** 
## prop_type_simplifiedOther                     0.2340    
## prop_type_simplifiedVilla       < 0.0000000000000002 ***
## number_of_reviews                         0.00000049 ***
## review_scores_rating                          0.0999 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25200 on 15157 degrees of freedom
## Multiple R-squared:  0.121,  Adjusted R-squared:  0.121 
## F-statistic:  348 on 6 and 15157 DF,  p-value: <0.0000000000000002
plot(model1)

hist(rstandard(model1))

vif(model1)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.02  4            1.00
## number_of_reviews    1.03  1            1.01
## review_scores_rating 1.02  1            1.01
bptest(model1)
## 
##  studentized Breusch-Pagan test
## 
## data:  model1
## BP = 232, df = 6, p-value <0.0000000000000002
outlierTest(model1)
##       rstudent
## 11271     51.3
## 7629      27.0
## 14925     25.7
## 5788      24.6
## 10099     22.8
## 11630     22.4
## 7630      21.9
## 7628      21.9
## 7448      21.9
## 10829     21.9
##                                                                                                                                                     unadjusted p-value
## 11271 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 7629  0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000126
## 14925 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007983499999999998
## 5788  0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002248499999999999268915646284
## 10099 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000008514699999999998887407205483270459632269445262
## 11630 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000017653999999999997139580909755029453942195577748392
## 7630  0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000583649999999999923186102589075707416687641975890849565
## 7628  0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000600439999999999847585238188017319190671962500168875057
## 7448  0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000606859999999999865730038734441742969341264879935444477
## 10829 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000606859999999999865730038734441742969341264879935444477
##                                                                                                                                                       Bonferroni p
## 11271 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 7629  0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000191
## 14925 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000012105999999999995
## 5788  0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003409699999999998875200300255
## 10099 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000012911999999999998245735330732217302080912452105
## 11630 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000026770999999999996184072387504995849260463826623226
## 7630  0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000885039999999999876096780013780954653828676189386586889
## 7628  0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000910509999999999815553244863287992384541169941610996367
## 7448  0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000920249999999999885916981885468414196076770051294187062
## 10829 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000920249999999999885916981885468414196076770051294187062

Initial analysis of model1 of untransformed price_4_nights against some intuitive explanatory variables: Property Type(simplified),Number of Reviews and Review Score Rating yields a rather weak model with serious violations of underlying assumptions.

The Adjusted R squared reveals that this model only explains approximately 12% of the variability in the dependent variable. It assumes Apartment property types as a base and from the summary we can see that the property being a guest suite is a significant variable and reduces the price by 2491.337, while renting a house or villa is also significant factor and increases the price in comparison to renting an apartment by 1658.674 and 49620.250 respectively on average over 4 nights for 2 people. Other property types are not significant. Number of reviews also proved to be significant ,however, the review rating does not seem to be significant at the 5% level. It’s interesting to note that more reviews equates to a lower price (-33.38 per review)whereas the quality of reviews does not seem to have a significant effect.

Plotting the standardized residuals and Normal-QQ plot reveals the non-normality of the data, we then conducted a Breusch-Pagan test (BP test) which confirmed suspicion of Heteroscedasticity, This can be seen in the residuals vs fitted plot. These violations of underlying assumptions of multiple-regression make this model flawed and thus we will no longer consider the untransformed model.It can be noted that no particular observations was beyond 0.5 Cook’s Distance however, when running an outlier test, this revealed 10 possible outliers that we may want to consider removing to correct the underlying issues. However, as there are 15164 observations we deemed that removing these 10 outliers may have little to no effect.

Log transformed model

model1t <-lm(log(price_4_nights)~prop_type_simplified+number_of_reviews+review_scores_rating,data=listings_model) #running the same variables on a log transformed test 

summary(model1t)
## 
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
##     review_scores_rating, data = listings_model)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.651 -0.469 -0.078  0.400  5.563 
## 
## Coefficients:
##                                  Estimate Std. Error t value
## (Intercept)                      7.839536   0.079678   98.39
## prop_type_simplifiedGuest suite -0.425330   0.025503  -16.68
## prop_type_simplifiedHouse        0.097761   0.016172    6.05
## prop_type_simplifiedOther       -0.009298   0.016336   -0.57
## prop_type_simplifiedVilla        1.634476   0.034498   47.38
## number_of_reviews               -0.001187   0.000204   -5.83
## review_scores_rating             0.006391   0.000844    7.57
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite < 0.0000000000000002 ***
## prop_type_simplifiedHouse          0.000000001526682 ***
## prop_type_simplifiedOther                       0.57    
## prop_type_simplifiedVilla       < 0.0000000000000002 ***
## number_of_reviews                  0.000000005775807 ***
## review_scores_rating               0.000000000000038 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.774 on 15157 degrees of freedom
## Multiple R-squared:  0.16,   Adjusted R-squared:  0.16 
## F-statistic:  481 on 6 and 15157 DF,  p-value: <0.0000000000000002
plot(model1t)

hist(rstandard(model1t))

vif(model1t)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.02  4            1.00
## number_of_reviews    1.03  1            1.01
## review_scores_rating 1.02  1            1.01
bptest(model1t)
## 
##  studentized Breusch-Pagan test
## 
## data:  model1t
## BP = 767, df = 6, p-value <0.0000000000000002

Due to the non-normality issues of the underlying residuals we analyze the log transformation to the dependent variable. This seems to correct the non-normality of the residuals as shown by the increase in slope and fit of the Normal-QQ plot, and more normal looking distribution of the standard residuals histogram. We also have a higher Adjusted R squared(approx 16%) and thus conclude that the log transformed model is a better estimate of price for 4 nights. We will thus continue the analysis considering the log transformed model. However we note that the residuals vs fitted plot again reveals heteroscedasticity issues that should be corrected. The VIF test reveals no multi-collinearity issues and we once again run an outlier test to identify possible observations that could be removed in order to improve the model.

Box-Cox Model

model1_box <-lm(bcPower(price_4_nights,lambda=-0.25)~prop_type_simplified+number_of_reviews+review_scores_rating, data=listings_model) #box cox lm model on 3 variables

summary(model1_box)
## 
## Call:
## lm(formula = bcPower(price_4_nights, lambda = -0.25) ~ prop_type_simplified + 
##     number_of_reviews + review_scores_rating, data = listings_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3670 -0.0518 -0.0020  0.0542  0.3729 
## 
## Coefficients:
##                                   Estimate Std. Error t value
## (Intercept)                      3.4358547  0.0090650  379.02
## prop_type_simplifiedGuest suite -0.0525631  0.0029015  -18.12
## prop_type_simplifiedHouse        0.0058936  0.0018399    3.20
## prop_type_simplifiedOther       -0.0019794  0.0018586   -1.07
## prop_type_simplifiedVilla        0.1536879  0.0039249   39.16
## number_of_reviews               -0.0001121  0.0000232   -4.84
## review_scores_rating             0.0007657  0.0000960    7.98
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite < 0.0000000000000002 ***
## prop_type_simplifiedHouse                     0.0014 ** 
## prop_type_simplifiedOther                     0.2869    
## prop_type_simplifiedVilla       < 0.0000000000000002 ***
## number_of_reviews                 0.0000013272095799 ***
## review_scores_rating              0.0000000000000016 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.088 on 15157 degrees of freedom
## Multiple R-squared:  0.125,  Adjusted R-squared:  0.124 
## F-statistic:  360 on 6 and 15157 DF,  p-value: <0.0000000000000002
plot(model1_box)

hist(rstandard(model1_box))

bptest(model1_box)
## 
##  studentized Breusch-Pagan test
## 
## data:  model1_box
## BP = 685, df = 6, p-value <0.0000000000000002
vif(model1_box)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.02  4            1.00
## number_of_reviews    1.03  1            1.01
## review_scores_rating 1.02  1            1.01
outlierTest(model1_box)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##       rstudent unadjusted p-value Bonferroni p
## 11271     4.24          0.0000226        0.343

While the box-cox transformation seemed to correct the underlying non-normality the most succesfully (see QQ-plot and histogram of residuals ),this transformation yielded a rather weak Adjusted R-squared of 12.43% meaning very little variability is being captured in this model. The log transformation seems to yield a stronger result while still maintaining normality. This model also seems to reflect heteroscedasticity issues(shown by residuals vs fitted plot and failing the BP test), however we will continue this analysis using both the box-cox and log transformed model.VIF test reveals no multi-colinearity issues and we once again run an outlier test to identify possible observations that could be removed in order to improve the model.

Considering room type as a factor:

We now proceed to consider an additional dependent variable room_type (Room Type), We hypothesize this factor to have a significant effect and that room types that are generally smaller or shared are cheaper than entire homes.

Log transformation model with room type

model2_log <-lm(log(price_4_nights)~prop_type_simplified+number_of_reviews+review_scores_rating+room_type, data=listings_model)

summary(model2_log)
## 
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
##     review_scores_rating + room_type, data = listings_model)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.781 -0.460 -0.093  0.372  5.968 
## 
## Coefficients:
##                                  Estimate Std. Error t value
## (Intercept)                      8.098284   0.073837  109.68
## prop_type_simplifiedGuest suite -0.281179   0.023745  -11.84
## prop_type_simplifiedHouse        0.310308   0.015535   19.98
## prop_type_simplifiedOther        0.183902   0.016143   11.39
## prop_type_simplifiedVilla        1.662733   0.031882   52.15
## number_of_reviews               -0.002412   0.000190  -12.69
## review_scores_rating             0.004737   0.000781    6.07
## room_typeHotel room             -0.056496   0.042113   -1.34
## room_typePrivate room           -0.700470   0.014317  -48.93
## room_typeShared room            -1.164642   0.069776  -16.69
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite < 0.0000000000000002 ***
## prop_type_simplifiedHouse       < 0.0000000000000002 ***
## prop_type_simplifiedOther       < 0.0000000000000002 ***
## prop_type_simplifiedVilla       < 0.0000000000000002 ***
## number_of_reviews               < 0.0000000000000002 ***
## review_scores_rating                    0.0000000013 ***
## room_typeHotel room                             0.18    
## room_typePrivate room           < 0.0000000000000002 ***
## room_typeShared room            < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.715 on 15154 degrees of freedom
## Multiple R-squared:  0.283,  Adjusted R-squared:  0.282 
## F-statistic:  664 on 9 and 15154 DF,  p-value: <0.0000000000000002
plot(model2_log)

hist(rstandard(model2_log))

bptest(model2_log)
## 
##  studentized Breusch-Pagan test
## 
## data:  model2_log
## BP = 389, df = 9, p-value <0.0000000000000002
vif(model2_log)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.23  4            1.03
## number_of_reviews    1.05  1            1.02
## review_scores_rating 1.02  1            1.01
## room_type            1.24  3            1.04
outlierTest(model2_log)
##       rstudent    unadjusted p-value     Bonferroni p
## 11271     8.37 0.0000000000000000633 0.00000000000096
## 14925     8.18 0.0000000000000003111 0.00000000000472
## 5788      7.43 0.0000000000001144000 0.00000000173480
## 12609     6.69 0.0000000000235000000 0.00000035636000
## 11630     6.32 0.0000000002707100000 0.00000410500000
## 9329      5.50 0.0000000385180000000 0.00058409000000
## 10830     5.47 0.0000000448000000000 0.00067935000000
## 15017     5.33 0.0000001007200000000 0.00152730000000
## 492       5.26 0.0000001465400000000 0.00222210000000
## 4502      5.16 0.0000002488100000000 0.00377300000000

The log transformed model showed significant improvement with the addition of the room type independent variable. Adjusted R-squared significantly increased from 16% to 28%. Using room type Entire Apt and property type apartment as a base all explanatory variables are now significant at the 5 % level except for room_type Hotel. Our intercept marginally increases meaning there was an increase in average price. The addition of room types also alters the slope of the co-efficient for some of the other factors. Property types Guest suite slope reduces in magnitude while all other factors in property type appear to increase in magnitude with property type ‘Other’ becoming significant. Review score rating also becomes significant which it was not before. And while the fitted vs residuals plot also showed improvement we note that it again fails the BP test and we have not solved the heteroscedasticity issues. VIF reveals no multi-colinearity issues and we once again identify possible outliers, which we will continue to do for the remaining models.

Box cox transformation model with room type

model2_box <-lm( bcPower(price_4_nights,lambda=-0.25)~prop_type_simplified+
                   number_of_reviews+
                   review_scores_rating+
                   room_type, data=listings_model) 

summary(model2_box)
## 
## Call:
## lm(formula = bcPower(price_4_nights, lambda = -0.25) ~ prop_type_simplified + 
##     number_of_reviews + review_scores_rating + room_type, data = listings_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3792 -0.0518 -0.0043  0.0497  0.4693 
## 
## Coefficients:
##                                   Estimate Std. Error t value
## (Intercept)                      3.4678496  0.0082809  418.78
## prop_type_simplifiedGuest suite -0.0349367  0.0026630  -13.12
## prop_type_simplifiedHouse        0.0319065  0.0017422   18.31
## prop_type_simplifiedOther        0.0220273  0.0018104   12.17
## prop_type_simplifiedVilla        0.1571410  0.0035756   43.95
## number_of_reviews               -0.0002632  0.0000213  -12.35
## review_scores_rating             0.0005601  0.0000876    6.40
## room_typeHotel room             -0.0099548  0.0047230   -2.11
## room_typePrivate room           -0.0855355  0.0016056  -53.27
## room_typeShared room            -0.1534480  0.0078254  -19.61
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite < 0.0000000000000002 ***
## prop_type_simplifiedHouse       < 0.0000000000000002 ***
## prop_type_simplifiedOther       < 0.0000000000000002 ***
## prop_type_simplifiedVilla       < 0.0000000000000002 ***
## number_of_reviews               < 0.0000000000000002 ***
## review_scores_rating                   0.00000000016 ***
## room_typeHotel room                            0.035 *  
## room_typePrivate room           < 0.0000000000000002 ***
## room_typeShared room            < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0802 on 15154 degrees of freedom
## Multiple R-squared:  0.274,  Adjusted R-squared:  0.273 
## F-statistic:  635 on 9 and 15154 DF,  p-value: <0.0000000000000002
plot(model2_box)

hist(rstandard(model2_box))

bptest(model2_box)
## 
##  studentized Breusch-Pagan test
## 
## data:  model2_box
## BP = 286, df = 9, p-value <0.0000000000000002
vif(model2_box)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.23  4            1.03
## number_of_reviews    1.05  1            1.02
## review_scores_rating 1.02  1            1.01
## room_type            1.24  3            1.04
outlierTest(model2_box)
##       rstudent unadjusted p-value Bonferroni p
## 14925     5.89      0.00000000401    0.0000608
## 11271     5.27      0.00000014067    0.0021332
## 12609     5.20      0.00000019905    0.0030184
## 5788      4.99      0.00000060215    0.0091310
## 9884     -4.73      0.00000221980    0.0336620

The box-cox transformed model also showed significant improvement with the addition of the room type independent variable. Adjusted R-squared significantly increased from ~12.5% to ~27%. Using room type Entire Apt and property type apartment as a base model all explanatory variables are now significant at the 5 % level. Just as in the log model we note that the data still reflects heteroscedasticity issues. Both the log transformed and box-cox, however, seem to have maintained normality in their residuals so we will continue assessing both models by next considering if number of bathrooms, bedrooms, beds, or size of the house (accomodates) significantly predict (price_4_nights)Price for 4 nights. We can conclude our hypothesis was correct and that room type is a significant predictor in Price for 4 nights

Considering beds,Bathrooms,Bedrooms and Size Factor:

Next we consider adding beds,bathrooms,bedrooms and size(accommodates) as we believe these quantitative factors should be able to capture more of the variability in the model. ### Log model

model3_log_BBB <-lm(log(price_4_nights)~prop_type_simplified+ #adding bed bathrooms, bedrooms and size(accoommodates)
                      number_of_reviews+
                      review_scores_rating+
                      room_type+bathrooms+
                      #bedrooms+
                      beds+
                      accommodates, 
                    data=listings_model)


summary(model3_log_BBB)
## 
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
##     review_scores_rating + room_type + bathrooms + beds + accommodates, 
##     data = listings_model)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.344 -0.372 -0.069  0.304  5.796 
## 
## Coefficients:
##                                  Estimate Std. Error t value
## (Intercept)                      7.466454   0.065764  113.53
## prop_type_simplifiedGuest suite -0.231529   0.020818  -11.12
## prop_type_simplifiedHouse       -0.013864   0.014466   -0.96
## prop_type_simplifiedOther        0.061177   0.014245    4.29
## prop_type_simplifiedVilla        0.726406   0.031016   23.42
## number_of_reviews               -0.001795   0.000166  -10.80
## review_scores_rating             0.004683   0.000688    6.81
## room_typeHotel room              0.113999   0.036863    3.09
## room_typePrivate room           -0.369377   0.013638  -27.08
## room_typeShared room            -0.803357   0.061913  -12.98
## bathrooms                        0.152655   0.006453   23.66
## beds                            -0.070300   0.004423  -15.90
## accommodates                     0.163130   0.003844   42.43
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite < 0.0000000000000002 ***
## prop_type_simplifiedHouse                      0.338    
## prop_type_simplifiedOther              0.00001761349 ***
## prop_type_simplifiedVilla       < 0.0000000000000002 ***
## number_of_reviews               < 0.0000000000000002 ***
## review_scores_rating                   0.00000000001 ***
## room_typeHotel room                            0.002 ** 
## room_typePrivate room           < 0.0000000000000002 ***
## room_typeShared room            < 0.0000000000000002 ***
## bathrooms                       < 0.0000000000000002 ***
## beds                            < 0.0000000000000002 ***
## accommodates                    < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.624 on 15069 degrees of freedom
##   (82 observations deleted due to missingness)
## Multiple R-squared:  0.453,  Adjusted R-squared:  0.453 
## F-statistic: 1.04e+03 on 12 and 15069 DF,  p-value: <0.0000000000000002
plot(model3_log_BBB)

hist(rstandard(model3_log_BBB))

vif(model3_log_BBB)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.67  4            1.07
## number_of_reviews    1.05  1            1.02
## review_scores_rating 1.02  1            1.01
## room_type            1.48  3            1.07
## bathrooms            2.27  1            1.51
## beds                 2.71  1            1.65
## accommodates         3.04  1            1.74
bptest(model3_log_BBB)
## 
##  studentized Breusch-Pagan test
## 
## data:  model3_log_BBB
## BP = 1203, df = 12, p-value <0.0000000000000002
outlierTest(model3_log_BBB)
##       rstudent        unadjusted p-value         Bonferroni p
## 12543    -9.58 0.00000000000000000000113 0.000000000000000017
## 11271     9.32 0.00000000000000000001327 0.000000000000000200
## 5788      9.06 0.00000000000000000014416 0.000000000000002174
## 14925     8.45 0.00000000000000003280700 0.000000000000494800
## 10091    -7.60 0.00000000000003041500000 0.000000000458720000
## 14352    -7.46 0.00000000000009162000000 0.000000001381800000
## 12609     7.42 0.00000000000012263000000 0.000000001849400000
## 10193     6.10 0.00000000107040000000000 0.000016143000000000
## 10830     5.98 0.00000000224480000000000 0.000033857000000000
## 9329      5.95 0.00000000282150000000000 0.000042553000000000

Adding Bed, Bathrooms and bedrooms and size factors to the log model seemed to dramatically improve the adjusted R-squared from ~28% to ~45%. All factors are significant at the 5% level,except for house property types having no significant effect with respect to the base (apartment property types). We continuously check the residuals plots which seem to show improvements in heteroscedasticity issues (as shown by the fitted vs residuals) we also see that listing 12543 falls outside of the 0.5 cooks distance and is an influential point that should be removed for model improvement. We also noted that bedrooms had a VIF>5 and thus showed multi-colinearity issues, we chose to remove it and run the model again which showed no multi-colinearity issues. Outliers were identified and examined.

Box-Cox model with beds,bathrooms,bedrooms and size

model3_box_BBB <-lm(bcPower(price_4_nights,lambda = -0.25)~prop_type_simplified+ #adding bed bathrooms, bedrooms and size(accoommodates)
                      number_of_reviews+
                      review_scores_rating+
                      room_type+bathrooms+
                      #bedrooms+
                      beds+
                      accommodates, 
                    data=listings_model)


summary(model3_box_BBB)
## 
## Call:
## lm(formula = bcPower(price_4_nights, lambda = -0.25) ~ prop_type_simplified + 
##     number_of_reviews + review_scores_rating + room_type + bathrooms + 
##     beds + accommodates, data = listings_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5040 -0.0421 -0.0027  0.0411  0.4341 
## 
## Coefficients:
##                                   Estimate Std. Error t value
## (Intercept)                      3.3989902  0.0074633  455.43
## prop_type_simplifiedGuest suite -0.0297488  0.0023626  -12.59
## prop_type_simplifiedHouse       -0.0033588  0.0016416   -2.05
## prop_type_simplifiedOther        0.0086542  0.0016166    5.35
## prop_type_simplifiedVilla        0.0569981  0.0035199   16.19
## number_of_reviews               -0.0001963  0.0000189  -10.40
## review_scores_rating             0.0005617  0.0000781    7.20
## room_typeHotel room              0.0086523  0.0041835    2.07
## room_typePrivate room           -0.0490164  0.0015478  -31.67
## room_typeShared room            -0.1158545  0.0070262  -16.49
## bathrooms                        0.0141448  0.0007323   19.32
## beds                            -0.0066829  0.0005019  -13.32
## accommodates                     0.0180558  0.0004363   41.39
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite < 0.0000000000000002 ***
## prop_type_simplifiedHouse                      0.041 *  
## prop_type_simplifiedOther           0.00000008765063 ***
## prop_type_simplifiedVilla       < 0.0000000000000002 ***
## number_of_reviews               < 0.0000000000000002 ***
## review_scores_rating                0.00000000000065 ***
## room_typeHotel room                            0.039 *  
## room_typePrivate room           < 0.0000000000000002 ***
## room_typeShared room            < 0.0000000000000002 ***
## bathrooms                       < 0.0000000000000002 ***
## beds                            < 0.0000000000000002 ***
## accommodates                    < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0708 on 15069 degrees of freedom
##   (82 observations deleted due to missingness)
## Multiple R-squared:  0.432,  Adjusted R-squared:  0.432 
## F-statistic:  955 on 12 and 15069 DF,  p-value: <0.0000000000000002
plot(model3_box_BBB)

hist(rstandard(model3_box_BBB))

bptest(model3_box_BBB)
## 
##  studentized Breusch-Pagan test
## 
## data:  model3_box_BBB
## BP = 1063, df = 12, p-value <0.0000000000000002
vif(model3_box_BBB)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.67  4            1.07
## number_of_reviews    1.05  1            1.02
## review_scores_rating 1.02  1            1.01
## room_type            1.48  3            1.07
## bathrooms            2.27  1            1.51
## beds                 2.71  1            1.65
## accommodates         3.04  1            1.74
outlierTest(model3_box_BBB)
##       rstudent  unadjusted p-value    Bonferroni p
## 12543    -7.84 0.00000000000000476 0.0000000000717
## 10091    -7.34 0.00000000000023141 0.0000000034901
## 14352    -6.58 0.00000000004895500 0.0000007383400
## 5788      6.14 0.00000000084685000 0.0000127720000
## 14925     5.81 0.00000000630930000 0.0000951570000
## 11271     5.67 0.00000001484500000 0.0002239000000
## 12609     5.66 0.00000001560200000 0.0002353100000
## 9884     -5.64 0.00000001712500000 0.0002582700000
## 10193     5.30 0.00000012041000000 0.0018160000000
## 7687     -5.11 0.00000031792000000 0.0047949000000

Adding Bed, Bathrooms and bedrooms and size factors to the box cox model seemed to also dramatically improve the adjusted r-squared from ~27% to ~43%. All factors are significant at the 5% level with underlying equation assuming apartment property type as a base. We check the residual plots again which seems to show improvement in heteroscedasticity issues(as shown by the fitted vs residuals and Scale-Location plots) we also see again that listing 12543 falls outside of the 0.5 Cook’s distance and is an influential point that should be removed for model improvement. We also noted that bedrooms had a VIF greater than 5 and thus showed multi-collinearity issues, we therefore chose to remove it and run the model again which, this time, showed no multi-collinearity issues. Outliers were identified and examined.

Considering host effects on Price:

We aim to improve our Adjusted r-squared by considering the effects that hosts have on prices.We examine if being a ‘super host’, having a high number of other listings and having their identification verified affects prices while still including previous significant factors from the previous model.

Log model including host effects factors**

model4_log_hosts <-lm(log(price_4_nights)~prop_type_simplified+
                      number_of_reviews+
                      review_scores_rating+
                      room_type+bathrooms+
                      #bedrooms+
                      beds+
                      accommodates+host_is_superhost+ #added host effects
                        host_total_listings_count+
                        host_identity_verified, 
                    data=listings_model)


summary(model4_log_hosts)
## 
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
##     review_scores_rating + room_type + bathrooms + beds + accommodates + 
##     host_is_superhost + host_total_listings_count + host_identity_verified, 
##     data = listings_model)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.277 -0.376 -0.059  0.313  5.813 
## 
## Coefficients:
##                                  Estimate Std. Error t value
## (Intercept)                      7.414781   0.065158  113.80
## prop_type_simplifiedGuest suite -0.204123   0.020471   -9.97
## prop_type_simplifiedHouse        0.006726   0.014203    0.47
## prop_type_simplifiedOther        0.077188   0.013970    5.53
## prop_type_simplifiedVilla        0.725077   0.030441   23.82
## number_of_reviews               -0.001807   0.000176  -10.24
## review_scores_rating             0.004590   0.000685    6.70
## room_typeHotel room              0.109130   0.036094    3.02
## room_typePrivate room           -0.348569   0.013433  -25.95
## room_typeShared room            -0.775051   0.060643  -12.78
## bathrooms                        0.150597   0.006324   23.82
## beds                            -0.069625   0.004333  -16.07
## accommodates                     0.164751   0.003765   43.76
## host_is_superhostTRUE            0.005984   0.012789    0.47
## host_total_listings_count        0.000844   0.000034   24.80
## host_identity_verifiedTRUE       0.086557   0.011840    7.31
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite < 0.0000000000000002 ***
## prop_type_simplifiedHouse                     0.6358    
## prop_type_simplifiedOther           0.00000003342798 ***
## prop_type_simplifiedVilla       < 0.0000000000000002 ***
## number_of_reviews               < 0.0000000000000002 ***
## review_scores_rating                0.00000000002202 ***
## room_typeHotel room                           0.0025 ** 
## room_typePrivate room           < 0.0000000000000002 ***
## room_typeShared room            < 0.0000000000000002 ***
## bathrooms                       < 0.0000000000000002 ***
## beds                            < 0.0000000000000002 ***
## accommodates                    < 0.0000000000000002 ***
## host_is_superhostTRUE                         0.6399    
## host_total_listings_count       < 0.0000000000000002 ***
## host_identity_verifiedTRUE          0.00000000000028 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.611 on 15065 degrees of freedom
##   (83 observations deleted due to missingness)
## Multiple R-squared:  0.476,  Adjusted R-squared:  0.475 
## F-statistic:  912 on 15 and 15065 DF,  p-value: <0.0000000000000002
plot(model4_log_hosts)

hist(rstandard(model4_log_hosts))

bptest(model4_log_hosts)
## 
##  studentized Breusch-Pagan test
## 
## data:  model4_log_hosts
## BP = 1467, df = 15, p-value <0.0000000000000002
vif(model4_log_hosts)
##                           GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified      1.70  4            1.07
## number_of_reviews         1.23  1            1.11
## review_scores_rating      1.06  1            1.03
## room_type                 1.50  3            1.07
## bathrooms                 2.27  1            1.51
## beds                      2.71  1            1.65
## accommodates              3.04  1            1.74
## host_is_superhost         1.26  1            1.12
## host_total_listings_count 1.02  1            1.01
## host_identity_verified    1.06  1            1.03
outlierTest(model4_log_hosts)
##       rstudent         unadjusted p-value           Bonferroni p
## 12543    -9.66 0.000000000000000000000496 0.00000000000000000748
## 11271     9.55 0.000000000000000000001519 0.00000000000000002291
## 5788      9.29 0.000000000000000000017380 0.00000000000000026211
## 14925     8.64 0.000000000000000006036900 0.00000000000009104200
## 10091    -7.85 0.000000000000004267600000 0.00000000006436000000
## 12609     7.58 0.000000000000035536000000 0.00000000053592000000
## 14352    -7.57 0.000000000000039691000000 0.00000000059858000000
## 10193     6.20 0.000000000570840000000000 0.00000860890000000000
## 9329      6.17 0.000000000714730000000000 0.00001077900000000000
## 10830     6.17 0.000000000716870000000000 0.00001081100000000000

The inclusion of host characteristics again improved our adjusted-R squared (but not that significantly) from 45% to 47%. However both host identity verification and hosts total listings count were found to be significant at a 5% level. Being a super host had no significant effect on pricing and will thus be removed in further models.

Box cox model including host effects factors

model4_box_hosts <-lm(bcPower(price_4_nights,lambda = -0.25)~prop_type_simplified+
                      number_of_reviews+
                      review_scores_rating+
                      room_type+bathrooms+
                      #bedrooms+
                      beds+
                      accommodates+host_is_superhost+
                        host_total_listings_count+
                        host_identity_verified, 
                    data=listings_model)


summary(model4_box_hosts)
## 
## Call:
## lm(formula = bcPower(price_4_nights, lambda = -0.25) ~ prop_type_simplified + 
##     number_of_reviews + review_scores_rating + room_type + bathrooms + 
##     beds + accommodates + host_is_superhost + host_total_listings_count + 
##     host_identity_verified, data = listings_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5080 -0.0424 -0.0022  0.0415  0.4361 
## 
## Coefficients:
##                                    Estimate  Std. Error t value
## (Intercept)                      3.39349548  0.00745510  455.19
## prop_type_simplifiedGuest suite -0.02718016  0.00234219  -11.60
## prop_type_simplifiedHouse       -0.00162996  0.00162508   -1.00
## prop_type_simplifiedOther        0.01003417  0.00159836    6.28
## prop_type_simplifiedVilla        0.05681804  0.00348287   16.31
## number_of_reviews               -0.00019683  0.00002019   -9.75
## review_scores_rating             0.00056181  0.00007841    7.16
## room_typeHotel room              0.00820489  0.00412965    1.99
## room_typePrivate room           -0.04717045  0.00153697  -30.69
## room_typeShared room            -0.11315680  0.00693844  -16.31
## bathrooms                        0.01393897  0.00072350   19.27
## beds                            -0.00660188  0.00049572  -13.32
## accommodates                     0.01818881  0.00043076   42.22
## host_is_superhostTRUE           -0.00057704  0.00146321   -0.39
## host_total_listings_count        0.00007397  0.00000389   19.00
## host_identity_verifiedTRUE       0.00956931  0.00135469    7.06
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite < 0.0000000000000002 ***
## prop_type_simplifiedHouse                      0.316    
## prop_type_simplifiedOther           0.00000000035283 ***
## prop_type_simplifiedVilla       < 0.0000000000000002 ***
## number_of_reviews               < 0.0000000000000002 ***
## review_scores_rating                0.00000000000082 ***
## room_typeHotel room                            0.047 *  
## room_typePrivate room           < 0.0000000000000002 ***
## room_typeShared room            < 0.0000000000000002 ***
## bathrooms                       < 0.0000000000000002 ***
## beds                            < 0.0000000000000002 ***
## accommodates                    < 0.0000000000000002 ***
## host_is_superhostTRUE                          0.693    
## host_total_listings_count       < 0.0000000000000002 ***
## host_identity_verifiedTRUE          0.00000000000169 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0699 on 15065 degrees of freedom
##   (83 observations deleted due to missingness)
## Multiple R-squared:  0.447,  Adjusted R-squared:  0.446 
## F-statistic:  811 on 15 and 15065 DF,  p-value: <0.0000000000000002
plot(model4_box_hosts)

hist(rstandard(model4_box_hosts))

bptest(model4_box_hosts)
## 
##  studentized Breusch-Pagan test
## 
## data:  model4_box_hosts
## BP = 1224, df = 15, p-value <0.0000000000000002
vif(model4_box_hosts)
##                           GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified      1.70  4            1.07
## number_of_reviews         1.23  1            1.11
## review_scores_rating      1.06  1            1.03
## room_type                 1.50  3            1.07
## bathrooms                 2.27  1            1.51
## beds                      2.71  1            1.65
## accommodates              3.04  1            1.74
## host_is_superhost         1.26  1            1.12
## host_total_listings_count 1.02  1            1.01
## host_identity_verified    1.06  1            1.03
outlierTest(model4_box_hosts)
##       rstudent  unadjusted p-value    Bonferroni p
## 12543    -7.84 0.00000000000000467 0.0000000000705
## 10091    -7.49 0.00000000000007202 0.0000000010862
## 14352    -6.62 0.00000000003811600 0.0000005748200
## 5788      6.25 0.00000000042417000 0.0000063969000
## 14925     5.90 0.00000000371120000 0.0000559690000
## 11271     5.76 0.00000000834710000 0.0001258800000
## 12609     5.74 0.00000000992930000 0.0001497400000
## 9884     -5.67 0.00000001437100000 0.0002167300000
## 10193     5.33 0.00000010059000000 0.0015170000000
## 7687     -5.18 0.00000022576000000 0.0034047000000

The inclusion of host characteristics again improved the adjusted R-squared from ~43% to ~45%.Being a super host as in the log transformed model again has no significant effect on price while host total listings and identity verification proved to be significant. At this point we also note that the box-cox transformation yielded weak adjusted r-squared values across the 4 models considered so far. We thus conclude the log transformed model as being a better transformation and continue our analysis only examining the log transformation of the dependent variable.

Examining Location Characteristics

We next choose to examine the effects that different neighborhoods(simplified) and having the location exactly as listed has on price by including the same factors as in the previous model with the exception of being a ‘super host’, as this proved insignificant.

model5_log_loc <-lm(log(price_4_nights)~prop_type_simplified+
                      number_of_reviews+
                      review_scores_rating+
                      room_type+bathrooms+
                      #bedrooms+ bedrooms had vif>5 so removed and run the model again
                      beds+
                      accommodates+
                        host_total_listings_count+
                        host_identity_verified+
                        is_location_exact+neighbourhood_simplified, #adding if property  location is  
                    data=listings_model)


summary(model5_log_loc)
## 
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
##     review_scores_rating + room_type + bathrooms + beds + accommodates + 
##     host_total_listings_count + host_identity_verified + is_location_exact + 
##     neighbourhood_simplified, data = listings_model)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.859 -0.369 -0.053  0.314  5.772 
## 
## Coefficients:
##                                            Estimate Std. Error t value
## (Intercept)                               7.5432855  0.0638998  118.05
## prop_type_simplifiedGuest suite          -0.1523122  0.0202181   -7.53
## prop_type_simplifiedHouse                 0.0297247  0.0140207    2.12
## prop_type_simplifiedOther                 0.0866387  0.0137265    6.31
## prop_type_simplifiedVilla                 0.7033002  0.0298822   23.54
## number_of_reviews                        -0.0023230  0.0001642  -14.15
## review_scores_rating                      0.0041759  0.0006614    6.31
## room_typeHotel room                       0.1546093  0.0355398    4.35
## room_typePrivate room                    -0.3152351  0.0132095  -23.86
## room_typeShared room                     -0.7750745  0.0594558  -13.04
## bathrooms                                 0.1374114  0.0062223   22.08
## beds                                     -0.0632894  0.0042559  -14.87
## accommodates                              0.1677926  0.0036935   45.43
## host_total_listings_count                 0.0007920  0.0000334   23.70
## host_identity_verifiedTRUE                0.0551713  0.0116436    4.74
## is_location_exactTRUE                    -0.0215715  0.0123723   -1.74
## neighbourhood_simplifiedCape Flats       -0.2093225  0.0253631   -8.25
## neighbourhood_simplifiedCBD              -0.2676851  0.0169095  -15.83
## neighbourhood_simplifiedOuter Suburbs    -0.2538057  0.0174141  -14.57
## neighbourhood_simplifiedSommerset        -0.2348116  0.0265401   -8.85
## neighbourhood_simplifiedSouthern Suburbs -0.5363686  0.0370788  -14.47
##                                                      Pr(>|t|)    
## (Intercept)                              < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite             0.000000000000052 ***
## prop_type_simplifiedHouse                               0.034 *  
## prop_type_simplifiedOther                   0.000000000283534 ***
## prop_type_simplifiedVilla                < 0.0000000000000002 ***
## number_of_reviews                        < 0.0000000000000002 ***
## review_scores_rating                        0.000000000279652 ***
## room_typeHotel room                         0.000013683949688 ***
## room_typePrivate room                    < 0.0000000000000002 ***
## room_typeShared room                     < 0.0000000000000002 ***
## bathrooms                                < 0.0000000000000002 ***
## beds                                     < 0.0000000000000002 ***
## accommodates                             < 0.0000000000000002 ***
## host_total_listings_count                < 0.0000000000000002 ***
## host_identity_verifiedTRUE                  0.000002174584338 ***
## is_location_exactTRUE                                   0.081 .  
## neighbourhood_simplifiedCape Flats       < 0.0000000000000002 ***
## neighbourhood_simplifiedCBD              < 0.0000000000000002 ***
## neighbourhood_simplifiedOuter Suburbs    < 0.0000000000000002 ***
## neighbourhood_simplifiedSommerset        < 0.0000000000000002 ***
## neighbourhood_simplifiedSouthern Suburbs < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.599 on 15060 degrees of freedom
##   (83 observations deleted due to missingness)
## Multiple R-squared:  0.497,  Adjusted R-squared:  0.496 
## F-statistic:  743 on 20 and 15060 DF,  p-value: <0.0000000000000002
plot(model5_log_loc)

hist(rstandard(model5_log_loc))

vif(model5_log_loc) #check for multi-colinearity issues
##                           GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified      1.75  4            1.07
## number_of_reviews         1.11  1            1.05
## review_scores_rating      1.02  1            1.01
## room_type                 1.52  3            1.07
## bathrooms                 2.29  1            1.51
## beds                      2.72  1            1.65
## accommodates              3.05  1            1.75
## host_total_listings_count 1.02  1            1.01
## host_identity_verified    1.06  1            1.03
## is_location_exact         1.03  1            1.01
## neighbourhood_simplified  1.14  5            1.01
bptest(model5_log_loc)#check for normality
## 
##  studentized Breusch-Pagan test
## 
## data:  model5_log_loc
## BP = 1348, df = 20, p-value <0.0000000000000002
outlierTest(model5_log_loc)#
##       rstudent        unadjusted p-value           Bonferroni p
## 5788      9.68 0.00000000000000000000042 0.00000000000000000633
## 11271     9.49 0.00000000000000000000263 0.00000000000000003969
## 12543    -9.08 0.00000000000000000011774 0.00000000000000177560
## 14925     8.68 0.00000000000000000419920 0.00000000000006332900
## 10091    -8.07 0.00000000000000073131000 0.00000000001102900000
## 12609     7.80 0.00000000000000636450000 0.00000000009598300000
## 14352    -6.77 0.00000000001290600000000 0.00000019463000000000
## 10830     6.60 0.00000000004262500000000 0.00000064283000000000
## 9160      6.50 0.00000000008405800000000 0.00000126770000000000
## 9329      6.24 0.00000000044699000000000 0.00000674110000000000

The inclusion of location factors again shows improvement in the model(adj r-squared increased from ~48% to ~50%). The only insignificant factor was having an exact location, while the addition of neighborhoods proved significant. The model assumes the Atlantic seaboard neighborhood as its base factor and still assumes apartment as base room type. The residuals vs fitted plot again shows improvement in the heteroscedasticity issues, however this model still fails the BP test. Again listing 12543 shows to be influential and should be removed. We examine the plot of residuals which still seems to maintain a normal shape and VIF examination reveals no multi-collinearity issues.

Adding the effects of having a Cancellation policy and having to pay a Security deposit

We then moved on to examine the effect that paying cancellation policies and security deposits had on Price for 4 nights.We add security_deposit and cancellation_policy to our previous model and removed is_location_ exact as it proved to be an insignificant factor in the previous model.

model6_log_policy <-lm(log(price_4_nights)~
                      prop_type_simplified+ 
                      number_of_reviews+
                      review_scores_rating+
                      room_type+
                        bathrooms+
                      #bedrooms removed as vif>5
                      beds+
                      accommodates+
                        host_total_listings_count+
                        host_identity_verified+
                      neighbourhood_simplified+ 
                      security_deposit+
                      cancellation_policy
                      , #adding if property  location is  
                    data=listings_model)


summary(model6_log_policy)
## 
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
##     review_scores_rating + room_type + bathrooms + beds + accommodates + 
##     host_total_listings_count + host_identity_verified + neighbourhood_simplified + 
##     security_deposit + cancellation_policy, data = listings_model)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.055 -0.340 -0.045  0.301  5.596 
## 
## Coefficients:
##                                                   Estimate  Std. Error t value
## (Intercept)                                     7.48292885  0.07350338  101.80
## prop_type_simplifiedGuest suite                -0.16001048  0.02399096   -6.67
## prop_type_simplifiedHouse                       0.03306866  0.01659402    1.99
## prop_type_simplifiedOther                       0.05112473  0.01566570    3.26
## prop_type_simplifiedVilla                       0.42811617  0.03279836   13.05
## number_of_reviews                              -0.00193359  0.00018346  -10.54
## review_scores_rating                            0.00439598  0.00076438    5.75
## room_typeHotel room                             0.13431339  0.03655813    3.67
## room_typePrivate room                          -0.26830328  0.01617991  -16.58
## room_typeShared room                           -0.53863805  0.09665737   -5.57
## bathrooms                                       0.17475525  0.00864045   20.23
## beds                                           -0.03764611  0.00512684   -7.34
## accommodates                                    0.11787335  0.00486724   24.22
## host_total_listings_count                       0.00081838  0.00003165   25.86
## host_identity_verifiedTRUE                      0.00282839  0.01303958    0.22
## neighbourhood_simplifiedCape Flats             -0.19995945  0.03170062   -6.31
## neighbourhood_simplifiedCBD                    -0.24967636  0.02060003  -12.12
## neighbourhood_simplifiedOuter Suburbs          -0.23735774  0.02055081  -11.55
## neighbourhood_simplifiedSommerset              -0.20533877  0.03259501   -6.30
## neighbourhood_simplifiedSouthern Suburbs       -0.61011373  0.04895935  -12.46
## security_deposit                                0.00002518  0.00000119   21.15
## cancellation_policyluxury_moderate              1.57613024  0.09220257   17.09
## cancellation_policyluxury_super_strict_95       2.03987601  0.22737962    8.97
## cancellation_policymoderate                    -0.03145849  0.01586924   -1.98
## cancellation_policystrict_14_with_grace_period  0.13633584  0.01440133    9.47
## cancellation_policysuper_strict_30              0.59608182  0.03460200   17.23
## cancellation_policysuper_strict_60             -0.22029215  0.10537916   -2.09
##                                                            Pr(>|t|)    
## (Intercept)                                    < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite                    0.00000000002703 ***
## prop_type_simplifiedHouse                                   0.04631 *  
## prop_type_simplifiedOther                                   0.00110 ** 
## prop_type_simplifiedVilla                      < 0.0000000000000002 ***
## number_of_reviews                              < 0.0000000000000002 ***
## review_scores_rating                               0.00000000913457 ***
## room_typeHotel room                                         0.00024 ***
## room_typePrivate room                          < 0.0000000000000002 ***
## room_typeShared room                               0.00000002575127 ***
## bathrooms                                      < 0.0000000000000002 ***
## beds                                               0.00000000000023 ***
## accommodates                                   < 0.0000000000000002 ***
## host_total_listings_count                      < 0.0000000000000002 ***
## host_identity_verifiedTRUE                                  0.82828    
## neighbourhood_simplifiedCape Flats                 0.00000000029534 ***
## neighbourhood_simplifiedCBD                    < 0.0000000000000002 ***
## neighbourhood_simplifiedOuter Suburbs          < 0.0000000000000002 ***
## neighbourhood_simplifiedSommerset                  0.00000000031101 ***
## neighbourhood_simplifiedSouthern Suburbs       < 0.0000000000000002 ***
## security_deposit                               < 0.0000000000000002 ***
## cancellation_policyluxury_moderate             < 0.0000000000000002 ***
## cancellation_policyluxury_super_strict_95      < 0.0000000000000002 ***
## cancellation_policymoderate                                 0.04747 *  
## cancellation_policystrict_14_with_grace_period < 0.0000000000000002 ***
## cancellation_policysuper_strict_30             < 0.0000000000000002 ***
## cancellation_policysuper_strict_60                          0.03660 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.551 on 9818 degrees of freedom
##   (5319 observations deleted due to missingness)
## Multiple R-squared:  0.576,  Adjusted R-squared:  0.575 
## F-statistic:  513 on 26 and 9818 DF,  p-value: <0.0000000000000002
plot(model6_log_policy)

hist(rstandard(model6_log_policy))

bptest(model6_log_policy)
## 
##  studentized Breusch-Pagan test
## 
## data:  model6_log_policy
## BP = 578, df = 26, p-value <0.0000000000000002
vif(model6_log_policy)
##                           GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified      2.09  4            1.10
## number_of_reviews         1.14  1            1.07
## review_scores_rating      1.02  1            1.01
## room_type                 1.54  3            1.07
## bathrooms                 2.86  1            1.69
## beds                      2.80  1            1.67
## accommodates              4.33  1            2.08
## host_total_listings_count 1.03  1            1.02
## host_identity_verified    1.07  1            1.04
## neighbourhood_simplified  1.16  5            1.02
## security_deposit          1.18  1            1.09
## cancellation_policy       1.25  6            1.02
outlierTest(model6_log_policy)
##       rstudent          unadjusted p-value             Bonferroni p
## 11271    10.21 0.0000000000000000000000024 0.0000000000000000000237
## 5788      8.48 0.0000000000000000256780000 0.0000000000002528000000
## 14925     8.46 0.0000000000000000305330000 0.0000000000003005900000
## 10830     7.26 0.0000000000004286400000000 0.0000000042200000000000
## 14641     6.75 0.0000000000157050000000000 0.0000001546200000000000
## 9329      6.65 0.0000000000302060000000000 0.0000002973700000000000
## 11578     5.75 0.0000000093548000000000008 0.0000920979999999999981
## 13345    -5.58 0.0000000242190000000000000 0.0002384400000000000007
## 10091    -5.57 0.0000000258959999999999991 0.0002549399999999999952
## 11579     5.39 0.0000000736369999999999996 0.0007249599999999999708

The inclusion of a cancellation policy and security deposit both seemed to be significant in determining the price as they both have significant p values. The Adjusted R-squared also improved to ~58% (from a previous 50%). This seems to be a fairly good model for predicting price for 4 nights. We examine the residuals and notice great improvement in meeting the underlying assumptions of Multiple- Regression. The fitted vs residual plot is flat and shows a more random nature in comparison to that of our initial model, however, the small p value of the BP test indicates the null hypothesis of homoscedasticity is rejected and the model still has heteroscedasticity issues. The examination of residuals histogram again reveals a far more normal underlying distribution to what we initially started with and that the log transformation was effective in correcting for non-normality. VIF reveals no multi-collinearity issues and the outlier test again reveals possible observations that should be removed for model improvement.

Trying to estimate a better model

While the previous model did prove to be promising, we elected to run an even more extensive model examining all possible factors that we believed could be significant on price. This included factors we had considered previously as well as the introduction of more specific factors such as reviews for value. The original factors were as follows :prop_type_simplified,review_scores_rating,room_type,bathrooms,bedrooms,beds,accommodation, host_total_listings_count, host_identity_verified, host_response_rate, neighborhood_simplified,security_deposit, cancellation_policy,instant_bookable, review_scores_cleanliness, review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value

price_4_nights_full <-lm(log(price_4_nights)~prop_type_simplified+  #examine effects of security deposit and cancellation  policy 
                      number_of_reviews+
                      review_scores_rating+
                      room_type+
                        bathrooms+
                      #bedrooms+ vif > 5 so removed
                      beds+
                      #accommodates+ vif>5
                        #host_total_listings_count+ #removed if location is exact as not significant
                        #host_identity_verified+
                        #host_response_rate+
                      neighbourhood_simplified+ 
                      security_deposit+
                      cancellation_policy+
                        #instant_bookable+
                        #review_scores_cleanliness+  #proved to be insignificant when run
                        #review_scores_checkin+
                        #review_scores_communication+
                        review_scores_location+
                        review_scores_value
                      , 
                    data=listings_model)


summary(price_4_nights_full)
## 
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
##     review_scores_rating + room_type + bathrooms + beds + neighbourhood_simplified + 
##     security_deposit + cancellation_policy + review_scores_location + 
##     review_scores_value, data = listings_model)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.287 -0.307 -0.035  0.285  3.885 
## 
## Coefficients:
##                                                   Estimate  Std. Error t value
## (Intercept)                                     7.34841258  0.08988197   81.76
## prop_type_simplifiedGuest suite                -0.18620047  0.02373552   -7.84
## prop_type_simplifiedHouse                       0.08167630  0.01705393    4.79
## prop_type_simplifiedOther                       0.01841417  0.01629029    1.13
## prop_type_simplifiedVilla                       0.39508295  0.03425863   11.53
## number_of_reviews                              -0.00127227  0.00017436   -7.30
## review_scores_rating                            0.00965609  0.00107531    8.98
## room_typeHotel room                             0.08205166  0.04640275    1.77
## room_typePrivate room                          -0.43907866  0.01791673  -24.51
## room_typeShared room                           -0.89066226  0.12707778   -7.01
## bathrooms                                       0.27686197  0.00839275   32.99
## beds                                            0.04882950  0.00491038    9.94
## neighbourhood_simplifiedCape Flats             -0.23610402  0.03528431   -6.69
## neighbourhood_simplifiedCBD                    -0.24505076  0.02281454  -10.74
## neighbourhood_simplifiedOuter Suburbs          -0.20660367  0.02185832   -9.45
## neighbourhood_simplifiedSommerset              -0.17634329  0.03584058   -4.92
## neighbourhood_simplifiedSouthern Suburbs       -0.37212657  0.06224877   -5.98
## security_deposit                                0.00002603  0.00000142   18.38
## cancellation_policymoderate                     0.03118345  0.01619578    1.93
## cancellation_policystrict_14_with_grace_period  0.12999877  0.01519273    8.56
## cancellation_policysuper_strict_30              0.53131804  0.04957303   10.72
## cancellation_policysuper_strict_60             -0.26249969  0.11155302   -2.35
## review_scores_location                          0.08040592  0.01020453    7.88
## review_scores_value                            -0.11625998  0.00992696  -11.71
##                                                            Pr(>|t|)    
## (Intercept)                                    < 0.0000000000000002 ***
## prop_type_simplifiedGuest suite                  0.0000000000000049 ***
## prop_type_simplifiedHouse                        0.0000017048391570 ***
## prop_type_simplifiedOther                                     0.258    
## prop_type_simplifiedVilla                      < 0.0000000000000002 ***
## number_of_reviews                                0.0000000000003241 ***
## review_scores_rating                           < 0.0000000000000002 ***
## room_typeHotel room                                           0.077 .  
## room_typePrivate room                          < 0.0000000000000002 ***
## room_typeShared room                             0.0000000000026063 ***
## bathrooms                                      < 0.0000000000000002 ***
## beds                                           < 0.0000000000000002 ***
## neighbourhood_simplifiedCape Flats               0.0000000000236357 ***
## neighbourhood_simplifiedCBD                    < 0.0000000000000002 ***
## neighbourhood_simplifiedOuter Suburbs          < 0.0000000000000002 ***
## neighbourhood_simplifiedSommerset                0.0000008823689233 ***
## neighbourhood_simplifiedSouthern Suburbs         0.0000000023585091 ***
## security_deposit                               < 0.0000000000000002 ***
## cancellation_policymoderate                                   0.054 .  
## cancellation_policystrict_14_with_grace_period < 0.0000000000000002 ***
## cancellation_policysuper_strict_30             < 0.0000000000000002 ***
## cancellation_policysuper_strict_60                            0.019 *  
## review_scores_location                           0.0000000000000037 ***
## review_scores_value                            < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.506 on 7727 degrees of freedom
##   (7413 observations deleted due to missingness)
## Multiple R-squared:  0.552,  Adjusted R-squared:  0.551 
## F-statistic:  414 on 23 and 7727 DF,  p-value: <0.0000000000000002
plot(price_4_nights_full)

hist(rstandard(price_4_nights_full))

vif(price_4_nights_full)
##                          GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified     1.81  4            1.08
## number_of_reviews        1.08  1            1.04
## review_scores_rating     2.38  1            1.54
## room_type                1.37  3            1.05
## bathrooms                2.31  1            1.52
## beds                     2.08  1            1.44
## neighbourhood_simplified 1.13  5            1.01
## security_deposit         1.17  1            1.08
## cancellation_policy      1.09  4            1.01
## review_scores_location   1.46  1            1.21
## review_scores_value      2.40  1            1.55
outlierTest(price_4_nights_full)
##       rstudent     unadjusted p-value       Bonferroni p
## 5212     -8.75 0.00000000000000000268 0.0000000000000207
## 10830     7.73 0.00000000000001223700 0.0000000000948480
## 9329      7.13 0.00000000000112090000 0.0000000086878000
## 5009     -6.30 0.00000000031892000000 0.0000024720000000
## 10725    -6.03 0.00000000175070000000 0.0000135690000000
## 8231      5.54 0.00000003047600000000 0.0002362200000000
## 11376    -5.45 0.00000005210700000000 0.0004038800000000
## 7588     -5.21 0.00000018891000000000 0.0014642000000000
## 14701    -5.04 0.00000048433000000000 0.0037541000000000
## 14868     5.04 0.00000048494000000000 0.0037588000000000
# check normality of residuals

Running this model revealed a very good ajusted R-squared of ~60%, however, we noted multi-collinearity issues with accommodation(size) and bedrooms both having a vif>5, we thus removed them and ran the model again. We noted that review scores for cleanliness, communication and check in were not significant and that host related factors(host_total_listings_count,host_identity_verified,host_response_rate) and being instantly bookable were not significant either, so we removed these factors and run the regression again. This reduced the value of the Adjusted R-squared to ~55% which is less than the prior model.

Comparison of models

library(huxtable)

kbl(huxreg(model1,model1t,model1_box,model2_log,model2_box,model3_log_BBB,model3_box_BBB,model4_log_hosts,
model4_box_hosts,model5_log_loc,model6_log_policy,price_4_nights_full))
names model1 model2 model3 model4 model5 model6 model7 model8 model9 model10 model11 model12
1 (Intercept) 2658.4405953067 7.83953624691064 *** 3.43585468456551 *** 8.098283963137 *** 3.46784955276225 *** 7.46645420635465 *** 3.39899017641452 *** 7.41478139492102 *** 3.39349548073157 *** 7.54328548954538 *** 7.48292884644085 *** 7.34841258365636 ***
2 (2594.60195797835) (0.0796778048139288) (0.00906504154132443) (0.0738372371605012) (0.00828087401406495) (0.0657644515527516) (0.00746331015229588) (0.0651583705520031) (0.00745509722289963) (0.0638998099374701) (0.0735033822899816) (0.0898819735320436)
3 prop_type_simplifiedGuest suite -2491.33724008497 ** -0.425330419773082 *** -0.0525630774751917 *** -0.281179331346766 *** -0.0349367202661454 *** -0.231529190040081 *** -0.0297487566697921 *** -0.204122563152797 *** -0.0271801584422914 *** -0.152312181187604 *** -0.160010481333662 *** -0.18620046810099 ***
4 (830.46106947796) (0.0255026844468232) (0.00290147167665601) (0.0237449814428094) (0.00266300863027687) (0.0208183476601531) (0.00236258011399085) (0.0204710137076261) (0.00234219174219315) (0.0202180582677605) (0.0239909593846555) (0.0237355219849422)
5 prop_type_simplifiedHouse 1658.67400770658 ** 0.0977612146241578 *** 0.00589362891853956 ** 0.310307758716861 *** 0.0319064666225461 *** -0.0138638733042324 -0.00335880430850463 * 0.00672580987886029 -0.00162995600799047 0.0297247256952743 * 0.0330686628492796 * 0.0816762951971237 ***
6 (526.605032195879) (0.0161715491042137) (0.00183985696844447) (0.0155347540969922) (0.00174222853486571) (0.0144656908920247) (0.00164164583061751) (0.0142033892162826) (0.00162508126899156) (0.0140206782814926) (0.0165940169470332) (0.0170539268889209)
7 prop_type_simplifiedOther 633.147275812926 -0.00929837023603162 -0.00197944730983314 0.183902464626499 *** 0.0220272921858152 *** 0.0611767355421061 *** 0.00865417978137493 *** 0.0771883231865824 *** 0.010034169176559 *** 0.0866387479649601 *** 0.0511247258787441 ** 0.0184141713900181
8 (531.962503667527) (0.0163360720534474) (0.00185857494608941) (0.0161425207974367) (0.00181038979969453) (0.014245166557147) (0.00161661952128988) (0.0139698495284163) (0.00159836081751772) (0.0137264783020154) (0.0156656982228964) (0.0162902861231909)
9 prop_type_simplifiedVilla 49620.2503668972 *** 1.63447552800511 *** 0.153687864099452 *** 1.66273257694416 *** 0.157140997627402 *** 0.726406004461657 *** 0.0569980603232139 *** 0.725076509688456 *** 0.0568180410646431 *** 0.703300186882409 *** 0.428116172310518 *** 0.395082950885861 ***
10 (1123.39373747912) (0.0344983733126429) (0.00392492222793463) (0.0318818895411535) (0.00357556594441291) (0.0310163985133541) (0.00351991077925571) (0.0304406778403017) (0.00348287120914565) (0.0298822267230871) (0.0327983645531646) (0.0342586295903378)
11 number_of_reviews -33.3801167771904 *** -0.00118673633535017 *** -0.000112100837401775 *** -0.00241185097492915 *** -0.000263233976907366 *** -0.00179495350681289 *** -0.000196252018407267 *** -0.00180694439679867 *** -0.000196831401711996 *** -0.00232295023785362 *** -0.00193358954575242 *** -0.0012722745264221 ***
12 (6.63256562374388) (0.000203679900710476) (0.0000231729120221763) (0.000190071725502894) (0.0000213166157490977) (0.000166276092029035) (0.0000188699216130307) (0.000176435586496053) (0.0000201868837382546) (0.000164156457193156) (0.000183458044900662) (0.000174362305990784)
13 review_scores_rating 45.2149276439987 0.00639120522839432 *** 0.000765672155446871 *** 0.00473701323057481 *** 0.000560090845174015 *** 0.00468252128793722 *** 0.000561700199474217 *** 0.00458976383267998 *** 0.000561808811247495 *** 0.00417586503762622 *** 0.00439597989279097 *** 0.00965608819757964 ***
14 (27.4792037337613) (0.000843860702720971) (0.0000960070667499058) (0.00078071084825284) (0.0000875570162754403) (0.000687845936668773) (0.0000780605242064202) (0.00068533851631602) (0.0000784130300750231) (0.000661372787816754) (0.000764379578568218) (0.0010753050494551)
15 room_typeHotel room -0.0564963545424016 -0.00995479369396976 * 0.11399932423255 ** 0.00865225364839856 * 0.109129934808675 ** 0.00820488948188399 * 0.154609339613853 *** 0.134313386970929 *** 0.0820516572689082
16 (0.0421129564893995) (0.00472298396391067) (0.0368634086461288) (0.00418346151303642) (0.0360936213485086) (0.00412965293637757) (0.0355398438394477) (0.0365581272724676) (0.0464027461451645)
17 room_typePrivate room -0.700469980255449 *** -0.0855355450364339 *** -0.369377156554424 *** -0.0490163544616129 *** -0.348569124958902 *** -0.0471704478853915 *** -0.315235118254419 *** -0.268303284784049 *** -0.439078661524376 ***
18 (0.0143165311649329) (0.00160560437327229) (0.0136383919977057) (0.00154775942099694) (0.0134332580494336) (0.00153696668486975) (0.0132094856694054) (0.0161799097702766) (0.0179167309673096)
19 room_typeShared room -1.16464217584111 *** -0.153447959401989 *** -0.803356553518357 *** -0.115854498439462 *** -0.775050740958292 *** -0.113156799767381 *** -0.775074523559198 *** -0.538638045556679 *** -0.890662262466139 ***
20 (0.069775713737354) (0.00782537262932127) (0.0619125618518909) (0.00702617661234841) (0.0606427605205365) (0.00693844354479059) (0.0594558034128224) (0.0966573653812291) (0.127077775104405)
21 bathrooms 0.152654748561735 *** 0.0141447554016876 *** 0.150596744376025 *** 0.0139389724371136 *** 0.137411391075311 *** 0.174755245747438 *** 0.276861974148033 ***
22 (0.00645252729903706) (0.000732268138209002) (0.0063234736344457) (0.000723500454843494) (0.00622225881717983) (0.00864045001425425) (0.00839275495171724)
23 beds -0.0702999384735502 *** -0.0066828848743994 *** -0.0696247715340496 *** -0.00660187817408571 *** -0.0632893933434176 *** -0.0376461085107749 *** 0.0488294961970899 ***
24 (0.00442263834245379) (0.000501905221770033) (0.00433266872204997) (0.000495722441857562) (0.00425594926531817) (0.00512683529428104) (0.00491038345493)
25 accommodates 0.163129874859234 *** 0.018055752323278 *** 0.164750933980334 *** 0.0181888077785848 *** 0.167792592105737 *** 0.117873350480238 ***
26 (0.0038443260839278) (0.000436275179272194) (0.00376489298847916) (0.000430760361641094) (0.00369346793576996) (0.00486723541125897)
27 host_is_superhostTRUE 0.00598361768086492 -0.000577043114074525
28 (0.0127886223848374) (0.00146321067298363)
29 host_total_listings_count 0.000844020509133167 *** 0.0000739708223228305 *** 0.000791974145422668 *** 0.0008183823076365 ***
30 (0.0000340342536170638) (0.00000389403030607606) (0.0000334196257433734) (0.0000316478130844066)
31 host_identity_verifiedTRUE 0.086556763268096 *** 0.00956931116890247 *** 0.055171264816713 *** 0.00282839290581888
32 (0.0118401432052085) (0.00135469039480408) (0.0116436177552824) (0.0130395755337533)
33 is_location_exactTRUE -0.0215715225897326
34 (0.0123723157834674)
35 neighbourhood_simplifiedCape Flats -0.20932251474412 *** -0.199959446561218 *** -0.23610402431263 ***
36 (0.0253630523138711) (0.0317006245466154) (0.0352843100357869)
37 neighbourhood_simplifiedCBD -0.267685079246393 *** -0.249676361299245 *** -0.245050758564061 ***
38 (0.0169095412834666) (0.0206000256489285) (0.0228145449511962)
39 neighbourhood_simplifiedOuter Suburbs -0.253805711866448 *** -0.237357737864287 *** -0.206603670659701 ***
40 (0.0174141080985606) (0.0205508061534903) (0.0218583221468348)
41 neighbourhood_simplifiedSommerset -0.234811556815801 *** -0.205338769971492 *** -0.176343289916416 ***
42 (0.0265401221782711) (0.0325950141372058) (0.0358405755289053)
43 neighbourhood_simplifiedSouthern Suburbs -0.536368612330183 *** -0.610113731173444 *** -0.372126566839667 ***
44 (0.037078791529449) (0.0489593522993336) (0.0622487661123216)
45 security_deposit 0.0000251828605729947 *** 0.0000260284807343289 ***
46 (0.00000119095316353771) (0.00000141627288372506)
47 cancellation_policyluxury_moderate 1.57613024491602 ***
48 (0.0922025677212688)
49 cancellation_policyluxury_super_strict_95 2.03987601101046 ***
50 (0.227379624341317)
51 cancellation_policymoderate -0.0314584877915848 * 0.0311834514222431
52 (0.0158692429990555) (0.0161957795060771)
53 cancellation_policystrict_14_with_grace_period 0.136335843026511 *** 0.129998774372139 ***
54 (0.0144013272748752) (0.0151927340619825)
55 cancellation_policysuper_strict_30 0.596081818069027 *** 0.531318037554978 ***
56 (0.0346020027403708) (0.0495730321868159)
57 cancellation_policysuper_strict_60 -0.220292151745535 * -0.262499688177884 *
58 (0.105379155889676) (0.111553020887095)
59 review_scores_location 0.0804059206785202 ***
60 (0.0102045324441997)
61 review_scores_value -0.116259979160915 ***
62 (0.0099269609371801)
1.1 N 15164 15164 15164 15164 15164 15082 15082 15081 15081 15081 9845 7751
2.1 R2 0.120999699699964 0.159838544045043 0.124647365151497 0.282813121926888 0.27391205376706 0.45302849453 0.432011276527834 0.475877032312182 0.446782403347269 0.496519241671087 0.576083444143207 0.552239556605096
3.1 logLik -175191.706979002 -17623.301572395 15336.6477329464 -16423.3921662058 16754.1456295263 -14279.7539953668 18539.8906945251 -13957.5184613583 18736.925091094 -13654.5348950004 -8096.00916312769 -5703.4576481468
4.1 AIC 350399.413958003 35262.6031447901 -30657.2954658927 32868.7843324115 -33486.2912590526 28587.5079907336 -37051.7813890502 27949.0369227166 -37439.850182188 27353.0697900008 16248.0183262554 11456.9152962936
.1 *** p < 0.001; ** p < 0.01; * p < 0.05.
#huxreg fits a maximum of 8 observations at a time so we ran two separate ones to compare all models analyzed
#price_4_nights_full was best at 0.58 A-R^2

Comparing all models we see that the highest R squared is given by our second-last model (model6_log_policy), which reads as follows:log(price_4_nights) = 7.483 -1.600e-01 prop_type_simplifiedGuest suite + 3.307e-02 prop_type_simplifiedHouse + 5.112e-02 rop_type_simplifiedOther + 4.281e-01 prop_type_simplifiedVilla -1.934e-03 number_of_reviews + 4.396e-03review_scores_rating + 1.343e-01 room_typeHotel room -2.683e-01 room_typePrivate room -5.386e-01room_typeShared room + 1.748e-01 bathrooms -3.765e-02 beds + 1.179e-01 accommodates + 8.184e-04 host_total_listings_count -2.000e-01 neighbourhood_simplifiedCape Flats -2.497e-01 neighbourhood_simplifiedCBD -2.374e-01 neighbourhood_simplifiedOuter Suburbs -2.053e-01 neighbourhood_simplifiedSommerset -6.101e-01 neighbourhood_simplifiedSouthern Suburbs + 2.518e-05 security_deposit + 1.576e+00 cancellation_policyluxury_moderate + 2.040e+00 cancellation_policyluxury_super_strict_95 -3.146e-02 cancellation_policymoderate + 1.363e-01 cancellation_policystrict_14_with_grace_period + 5.961e-01 cancellation_policysuper_strict_30 -2.203e-01 cancellation_policysuper_strict_60+ error

Where the model assumes apartment property types, Entire home/apt property type, Atlantic Seaboard neighborhood, and no(N/A)cancellation policy as its base hence all estimates of factor slope either correlated to increases or decreases on these base assumptions.

We proceed to run an ANOVA test and find the confidence intervals for our estimated model.The ANOVA reveals that there are significant differences between means of the groups as the p value is less than 0.05

library(kableExtra)

kbl(anova(model6_log_policy))
Df Sum Sq Mean Sq F value Pr(>F)
prop_type_simplified 4 1425.49 356.374 1171.9 0
number_of_reviews 1 22.38 22.383 73.6 0
review_scores_rating 1 19.17 19.172 63.0 0
room_type 3 592.17 197.391 649.1 0
bathrooms 1 1044.84 1044.844 3436.0 0
beds 1 4.26 4.261 14.0 0
accommodates 1 188.46 188.462 619.8 0
host_total_listings_count 1 219.63 219.631 722.3 0
host_identity_verified 1 8.63 8.626 28.4 0
neighbourhood_simplified 5 145.49 29.097 95.7 0
security_deposit 1 157.07 157.070 516.5 0
cancellation_policy 6 229.61 38.269 125.8 0
Residuals 9818 2985.54 0.304 NA NA
kbl(confint(model6_log_policy))
2.5 % 97.5 %
(Intercept) 7.339 7.627
prop_type_simplifiedGuest suite -0.207 -0.113
prop_type_simplifiedHouse 0.001 0.066
prop_type_simplifiedOther 0.020 0.082
prop_type_simplifiedVilla 0.364 0.492
number_of_reviews -0.002 -0.002
review_scores_rating 0.003 0.006
room_typeHotel room 0.063 0.206
room_typePrivate room -0.300 -0.237
room_typeShared room -0.728 -0.349
bathrooms 0.158 0.192
beds -0.048 -0.028
accommodates 0.108 0.127
host_total_listings_count 0.001 0.001
host_identity_verifiedTRUE -0.023 0.028
neighbourhood_simplifiedCape Flats -0.262 -0.138
neighbourhood_simplifiedCBD -0.290 -0.209
neighbourhood_simplifiedOuter Suburbs -0.278 -0.197
neighbourhood_simplifiedSommerset -0.269 -0.141
neighbourhood_simplifiedSouthern Suburbs -0.706 -0.514
security_deposit 0.000 0.000
cancellation_policyluxury_moderate 1.395 1.757
cancellation_policyluxury_super_strict_95 1.594 2.486
cancellation_policymoderate -0.063 0.000
cancellation_policystrict_14_with_grace_period 0.108 0.165
cancellation_policysuper_strict_30 0.528 0.664
cancellation_policysuper_strict_60 -0.427 -0.014

Running the model on the test data

We attempted to run the test data against the predicted values and analyze the Root Mean Squared Error, however this proved problematic as large portions of the test data contain N/A variables which are obscuring the prediction. We would thus need to clean and rename N/A values across all possible variables.This proves difficult as we can not impute missing values with the mean or median as we risk creating unnecessary bias in our model.

rmse_model <- listings_model%>%
              mutate(predictions=predict(model6_log_policy,.))%>%
              summarize(sqrt(sum(exp(predictions)-price_4_nights)**2/n()))%>%
             pull()
rmse_model
## [1] NA
#rmse_test <- listing_test%>%
              #mutate(predictions=predict(model6_log_policy,.))%>%
              #summarize(sqrt(sum(exp(predictions)-price_4_nights)**2/n()))%>%
             #pull()
#rmse_test

Prediction of Price

After a very challenging Applied Statistics with R module, two of us are planning to visit Cape Town for 4 nights over reading week, and we want to stay in an Airbnb. Find Airbnb’s that are apartment with a private room, have at least 10 reviews, and an average rating of at least 90. Use your best model to predict the total cost to stay at this Airbnb for 4 nights. Include the appropriate 95% interval with your prediction. Report the point prediction and interval in terms of price_4_nights. We have also elected for a rental that has 2 beds, 1 bathroom and accommodates 2.We are not interested in the number of host listing counts and would like to not pay a security deposit or cancellation fee as we are students. After consulting our Cape Town expert (Jason) we would like to stay in the Atlantic Seaboard neighborhood as he advises it has the best views and shops.

predict = 7.483 -0.001934*(10) + 0.004396*(90) -0.2683*(1)+  0.1748*(1)  -0.03765*(2) + 0.1179*(2) # substituted criteria into equation deleted all values that did not apply or will null value like zero security deposit

Price= exp(predict)
Price
## [1] 2769
exp(0.5514)
## [1] 1.74
# SE for confidence intervals is 0.5514

upper= exp(predict+0.5514*1.96)
lower=exp(predict-0.5514*1.96)

ci= merge(upper,lower)#sorry Kostis for the merge ;)
names(ci)[1] = "97.5%"
names(ci)[2] = "2.5%"

kbl(ci) 
97.5% 2.5%
8160 940

The model estimates that it will cost $2769.17(Although we believe currencies are actually reported in ZAR but recorded in dollars) with a 95% confidence interval [939.69; 8160.36]. This is a large interval considering the sample size however it is not that unexpected as the inverse log value for standard error is

Suggestion for further work

There are some methods that can be implemented to improve the quality of this model. Firstly, the method of obtaining data. There are multitudes of missing values across the entire data set as well as many incorrect observation in variables such as a property that was listed in the USA but appeared on this data set.This brings suspicion on the quality and reliability of data supplied by AirBnB. One significant problem we mentioned was heteroscedasticity, which we didn’t manage to remove from our final model. This could have been solved by removing outliers and influential points identified for the model, or examining a different transformation than the ones we did.

Another issue comes from the fact that the actual property prices are decided by the hosts themselves, not based on any predefined criteria i.e hosts just set the price they want. This can have a large impact on the outcome of this analysis, as things like pre-definition can influence the overall outcome.

Further analysis includes obtaining information on possible influential factors that are not included in this dataset. Such as whether the property is new or old, whether there is an elevator/lift and, as crime is a major issue in South Africa, whether some sort of security is included in a particular listing.